How to choose a database for your project?
Technology choices are often influenced by hype or company politics. People tend to over-engineer using tools allowing them to handle “future” traffic, which is rarely achieved. Of course, sometimes they just want to have fun by using something new.
Today, let’s have a solid discussion about databases! We’re going to start with theories and qualities that should be considered when choosing a database solution. Then we'll talk about data models and types - SQL, NoSQL, and NewSQL. We’ll dive into a few most popular, open-source databases and a few most common use cases describing ideas how they can be handled. The goal is to learn how to make better choices.
CAP Theorem / Brewer’s Theorem & later update
CAP is something most of you have already heard about, but let’s do a small recap.
The original CAP Theorem
In 2000, Eric Brewer presented a theory concerning distributed systems (if you have 2 database nodes then it probably already applies). It claims that the system can satisfy only 2 of 3 specific traits - Consistency, Availability, and Partition Tolerance.
What do they mean?
- Consistency - all clients querying the system see exactly the same data, it does not matter to which node they are connected
- Availability - all requests made to the system get successful responses (i.e. it does not fail because a part of the system is not accessible)
- Partition Tolerance - if there is a network issue, e.g. a split brain breaking a cluster to 2 independent parts, then the system still works
When you google CAP, the thing common among all articles are graphics of a triangle, where each trait is in one angle and edges are described by database types or specific products that match them. However, is it really so easy to classify modern systems? Not really. In 2012, Brewer has published an update - CAP Twelve Years Later: How the "Rules" Have Changed, saying that the world is not so black and white. Today databases often have tunable settings, they may concern specific cluster, table, or sometimes even query. You have to remember about them, just know what tradeoffs each option brings. For more information, I recommend reading Brewer’s article but also the article Please stop calling databases CP or AP from Martin Kleppman.
In 2010, yet another related theorem was formulated. PACELC is a CAP extension saying that in the case of network Partition (P), the system has to choose to be Available (A) or Consistent (C). Else (E), when running normally, it has to choose between Latency (L) and Consistency (C).
You might also be interested in:
Qualities - what do you need to think about when choosing a database technology?
Apart from CAP, we have more qualities & features we need to think about when making a choice.
Response time is important for user experience. Some solutions may slow down when they grow in size. In other cases, a big load may influence how fast the database replies. The more complicated the queries are, the bigger time grows. Inserting rows, requiring deduplication based on some unique value will be always slower than a pure upsert because the db needs to do more work to handle the request. All types of transactions will increase the latency as well. If you want to fetch a lot of data, then disk speed, network throughput, and other factors will also influence the results.
High availability & failure tolerance
Everybody would like to have 100% available systems, but that is not possible. Remember about the things we talked about when discussing the CAP/PACELC theorem. In reality, to achieve higher availability, we need to add more nodes and more independent data centers in geographic distribution. Often it is analyzed how many nodes may fail at the same time so that the system still works and replies to the queries. In some cases, failure won’t be even noticed, in others, there may be a small downtime when the database is switching to the backup node.
Remember that in order to handle network partitions, we may need to sacrifice a bit of availability (writes) or data consistency.
We can distinguish two basic forms of scaling:
- vertical - achieved by adding more resources - CPU, RAM
- horizontal - achieved by adding more instances to the system
Some databases just can’t be clustered and scaled horizontally. For high availability, they usually use the active-passive approach, where one database is active, but the changes are synchronously replicated to a passive node. Scaling is influenced by the data models. Relational data models are difficult to distribute, so the most common case is to have them with a single writable node or to shard the data. On the other hand, key value stores can be very easily scaled horizontally.
In order to be faster and achieve better latency, sacrifices are made. There are databases that reply to the client operations when the data is written to a disk. However, in many cases, dbs return to the client just once the change is put in the memory or the disk cache. In case of failure, some data may be lost. Of course, there are preventive mechanisms - clusters when scaled horizontally still may have the data on other nodes.
Licenses & commercial variants
Databases are offered with various licenses. Usually, there are no limitations about their usage for commercial purposes. In some cases, there are a few variants offered - basic free & open-source and more advanced, paid with additional features. Nowadays, hosted and as-a-service versions are available as well. The main limitation put on the licenses concerns offering self-managed & hosted versions - so that the main company does not get any competition.
Transactions & Consistency
Most people first meet SQL databases and later go into the NoSQL world. It can be surprising that ACID transactions are actually a not-so-common feature. Eric Brewer (yes, the guy from the CAP theorem) defined another acronym called BASE. It can be explained as:
- Basically available - this refers to the availability from CAP.
- Soft state - state of the database can change, even when there are no operations.
- Eventual consistency - indicates that the system will get consistent at some point in time. This can be a result of e.g. some data synchronization mechanisms happening in the background. Depending on a case, milliseconds may be enough.
Getting ACID and full consistency is difficult and costly. What is more, for many applications, it is actually not needed. Think about the systems you have worked on, in how many cases, after writing data, 1 ms later someone could be reading it and the response had to be already fully consistent with the write? Actually, not too often. Example: simple CRUD, user accounts, people may say, this must be strictly consistent! However, users don’t edit their accounts too often, and even if they do, this happens mostly from one session and reading may not happen in milliseconds after the write.
There are many other factors that may influence your decisions. Financial companies often need audit logs - they are not supported by all solutions. Sometimes you’d like to get a stream of data changes (see CDC, Debezium, etc.). For others, it is important how easy it is to deploy to Kubernetes. Companies may need built-in backup capabilities, data-at-rest encryption, and many other features.
You might also be interested in:
Data models & query types - SQL, NoSQL or NewSQL?
In this section, we will discuss the most popular types of databases.
I think this is the most popular and known data model. It consists of tables connected using relations. Tables have many columns with specific types, rows are defined by primary keys - they are a great fit for storing structured data. Relations are defined using foreign keys. Data can be queried using any column, but additional indexes may improve the performance of those queries. Queries are defined, of course, using SQL (Structured Query Language), which exists since 1974! SQL is very powerful, supports transactions and is constantly being extended.
However, relational databases also have some drawbacks. Due to transactions and durability, latency won’t be ultra low. They don’t scale horizontally very well. Relations make that difficult. In order to avoid it, some databases support sharding or partitioning to divide data into smaller parts, but that's only a workaround. In recent years, a new category of databases emerged, called NewSQL. It brings horizontal scaling to the relational world, usually by using additional consensus algorithms like Raft or Paxos. Unfortunately, you have to remember they are not a miraculous solution. A single NewSQL node can be even a few times slower than a single SQL node, but you can just set up a hundred of them.
Examples of SQL databases:
Examples of NewSQL databases:
A very simple model, based on keys & values (who would guess that!), similar to a dictionary data structure. Values are put under specific keys and can be retrieved only using those keys. Values can be just simple types like string or number, or more complex like a list or map. Advantages? Simple, fast, easy to scale. Disadvantages? Not too powerful read queries (mostly get value by key).
Examples of Key-value databases:
Many graph-related structures can be modeled in different databases, e.g. in relational one, however, they are not the easiest and most efficient for that.
Graph databases focus on nodes and vertices. If you’d like to model a social network showing who has what friends and then find those with common friends, then graph database is for you.
Examples of Graph databases:
Document databases are actually a special subtype of key value stores. However, their values contain documents e.g. in json, yaml or xml. In many cases, those documents do not have to share any schema (so they may be used also for storing unstructured data) but can refer to each other ~ have relations! What about queries? They are more powerful than in raw key-value stores, often it is possible to create queries based on fields of inserted documents.
Examples of Document-oriented databases:
You might also be interested in:
This type is sometimes called tabular, column-oriented or column family databases. We have tables here, but those tables are not connected using any relations. Additionally, rows are divided into partitions (based on partition key), which indicates their physical location in the cluster. There are additional keys that define rows order inside of the partition. Since this model is usually related with solutions easy to scale horizontally, queries are very limited - you may need to define the partition key every time (so that it knows on which node the data is located). Due to distributed nature, indexes are also not too efficient. This makes modeling data quite challenging. Actually, there are whole modeling methodologies for these databases and they start with defining all queries you'd like to make against your data. In many cases, it appears that in order to satisfy them, you’d need to highly denormalize your data.
Examples of Wide Column databases:
There are NoSQL databases that can actually be categorized into multiple models. ArangoDB is theoretically a graph database, but its nodes may contain documents. Redis extensions allow for graph operations. Datastax Enterprise adds Spark, Solr, and graph features to Cassandra. As you can see, databases do not need to stick to a single model. What should be important for you here? What queries you’d like to do in the future and what is actually the format of your data. There is no single format that will fit all problems.
Best known SQL & NoSQL databases
Nowadays, there are so many technologies in the market that it is difficult to cover all of them. In this section, we’re going to focus on a few popular ones that you may encounter during your work.
PostgreSQL / MySQL / MariaDB - SQL databases
MySQL, PostgreSQL, and MariaDB are the most popular open-source RDBMS (Relational Database Management Systems) for OLTP (Online Transaction Processing). They have been on the market for many years and include many features you would not expect to find. All of them can support geospatial data. PostgreSQL has even a built-in pub-sub like mechanism. In terms of clustering & high availability, the basic model leverages the active & passive approach, with failover when the active node fails. It is also possible to make a passive node “Hot Standby”, making it available for read queries. This way, you can scale reads, but not writes. For writes, you need to partition your data.
As these relational databases are the most popular choices, all cloud providers offer their hosted versions including the high-availability features.
Redis is a key value store. Although such a data model seems quite simple, Redis offers tons of additional features including geospatial indexes. It keeps data in memory, periodically writing them to a disk, which makes it ultra fast (see benchmarks), but adds a bit of a risk related to durability. It is often used as a cache, however, it supports transactions, Lua scripting, can be clustered, and includes a Pub/Sub mechanism leveraged e.g. by Signal IM. All major cloud providers offer Redis in hosted variants, see e.g. Amazon ElastiCache or GCP Memorystore.
You might also be interested in:
Apache Cassandra is often used together with Big Data. A few years ago, Apple announced their cluster had over 115 000 nodes and stored over 10 PB of data. Cassandra scales horizontally in almost linear fashion with no single point of failure. As usual, there must be some drawbacks. You won’t find full transactions here. As mentioned before, modeling is difficult and really before starting to use Cassandra, you should define all queries you’ll need in your application, which is of course in many cases just not possible. DataStax offers commercially supported Cassandra distributions and hosted variants. An open-source version is available also from other cloud providers:
If you’d like to learn more about Cassandra, take a look at our another blogpost "7 mistakes when using Apache Cassandra":
MongoDB is probably the best known document database. It offers a few clustering models, including sharding features and active-passive approach. Mongo supports transactions, indexes, but also aggregation operations with map-reduce like features. It is supported also in cloud as MongoDB Atlas.
CockroachDB is a NewSQL (relational, but scales horizontally) created by ex-Google employees, inspired by Google Spanner. It implements the PostgreSQL write protocol, so it is compatible with PostgreSQL drivers. Core open-source version is free, however, Cockroach Labs offers other variants including hosted dedicated clusters but also a serverless product.
Many companies say that they store data in Elastic. It offers a document data model with really great search capabilities. However, you may meet opinions that it shouldn’t be used as your main database, see Why one shouldn’t use ElasticSearch as a data storage.
Most common use cases and what to use when
Modeling user accounts
Modeling a CRUD (Create Read Update Delete) User service is a quite common task. What to use? Honestly, if you are not dealing with dozens of millions of users and a 5-min inability to edit user accounts is not a problem, a relational database management system like PostgreSQL is enough.
Storing IoT readings
For this problem, you may consider multiple options. If you have huge amounts of data to store, Apache Cassandra may be a good choice. However, in order to process that data, you’ll need to leverage e.g. Apache Spark. Among other options are typical Time series databases like:
- InfluxDB - NoSQL database designed for time-series data.
- TimescaleDB - time-series based on PostgreSQL relational database.
- Prometheus - often leveraged in monitoring systems.
Their main advantage is that they have built-in support for more time & window related operations.
Yet another approach would be to store them on Apache Kafka topics with configured infinite retention. Depending on the use case, they can be processed in real-time in a streaming manner or as a batch using e.g. Apache Spark.
Full text search
The most powerful full text search engines are Elastic and Solr. However, depending on what you actually need to do, they may not be required. Many other databases include built-in support for full text search like operations, so as a first step, check what is possible in your current main database. As an example, see full text search in PostgreSQL.
In many cases, having local, in-memory cache may be enough. You can use libraries like Caffeine or Guava Cache. If you have a bigger amount of data to cache or you want to share it among multiple application instances, then you can choose Redis or memcached.
There is no perfect solution that fits all problems and requirements. Choosing a database is not an easy task and usually there is more than one option that could work for you. In practice, most projects do not operate on a huge scale and don’t need ultra-scalable NoSQL databases. Short downtimes (like e.g. 5 min to switch between active & passive instances) can be totally hidden from the user by having workarounds e.g. in the form of an additional caching layer. However, a few minutes of inability to write is often just acceptable. Of course, there are exceptions for which solutions like Apache Cassandra are a great choice. Just remember that they shouldn’t be the first choice for every single project since they have an additional overhead related to difficult data modeling and implementation. It may be safer to start a project with relational and once you know all the queries and how the load changes, then move to another solution.
Unfortunately, you won’t omit politics in your everyday work. I have personally met companies that allowed in the past to use only one selected database type, which was not the best fit for most of the services. There are some who bought Oracle licenses or racks many years ago and don’t allow for anything else. The only thing you can do is to negotiate, analyze what you really need to achieve, and refrain a bit from using everything that is currently on hype.
I hope that information gathered in this article will help you choose the right database for your system. Keep in mind that depending on the project and how much data do you need to store, the best choice may be actually to leverage multiple databases!
If you're looking for content around Big Data, subscribe to Data Times - the monthly newsletter curated by SoftwareMill's Engineers.