How To Choose a Database?
[Version française ici]
So you have a big idea that’ll change the world. As any company, startup or not, you need to store data. Fortunately, you have plenty of choices for your database… Fortunately? Not so sure! With all the choices, how do you decide which type of database you need? Here is a quick refresher on the different database types you can choose from.
Relational databases have a long history and are widely used. They were designed to optimize data storage and to offer control over your data, think about concepts like foreign keys, types, and consistency.
NoSQL, short for “Not only SQL,” is a term used for database systems that store information in a variety of formats to support scale requirements that traditional relational databases have difficulties addressing.
While NoSQL solves the scalability issue encountered with relational databases, it also removes the foreign keys, transactions (some can do multi-document transactions, but it’s another topic to discuss the differences between isolation levels offered), and types.
The last option is Distributed SQL databases. Distributed SQL is designed to offer consistency, ACID transactions, AND scale. Not all distributed SQL databases are equal. You have many of them relying on Postgres or MySQL engines behind the scene. Usage of Foreign Keys or Transactions can be complicated, as the core is still working with engines based on a single node database. CockroachDB was built from the ground up as a distributed SQL database.
Let’s explore the 3 options.
Relational databases
Relational databases are well-known nowadays and widely used, especially with offers from a load of vendors. You can think about Oracle, SQL Server, MySQL, and Postgres for example.
You can always find a library to connect and use your database whatever language you’re using in your application. A lot of standards exist also to help you define your schema, migrate your database, and consume your data.
The problem is at scale. These databases were designed to work on a single node. Relational data are linked by foreign-key definitions, transactions are locks on data access. All of that is working fine on a single node, but it’s another story to build something scalable.
This lets you with two options to scale:
- Vertical scaling
- Sharding
Vertical scaling
Vertical scaling is, in fact, upsizing your VM or bare-metal host. More CPUs, more memory, more disks, etc. This solution is usually easy when the upgrade is small enough, for example going from 4 vCPUs to 8 vCPUs.
This is another price when you need 256 vCPUs and 2Tb of memory. Add to this the fact that your database is still on only one node. If you want to be resilient, you’ll need at least 2 of these machines and start a replication and failover process between them.
Sharding
Sharding is even worse! It doesn’t solve the replication issue for your failover scenario, it adds complexity to it. Splitting the database into smaller parts sounds like a good idea, but it increases the complexity of development and management.
Each shard will become a single node of your database, so if you want to be able to use transactions and foreign keys, you need to have all the data involved in the same shard. You also need to implement fail-over for each shard.
NoSQL database
You have multiple choices when it comes to NoSQL as you don’t have standards. Each and every solution comes with different implementations and tradeoffs, but they commonly use multiple nodes to spread data. Compared to relational databases, you don’t need to do anything, you just implement a cluster, and not all the data needs to be replicated on each node.
One common conception is that a NoSQL database will offer ease of scalability and performance (I mean real-time). Many of them provide different flavors of SQL to interact with your data so it helps developers reuse their knowledge.
Moreover, they are designed to survive failures of nodes, at least the failure of a single node. Replication processes are not necessarily synchronous, think about speed, but data are replicated between nodes to ensure your application can survive failures on the database.
But the main issue is you don’t have any kind of relationship between data. The main concept is often JSON documents stored as-is, just like Object databases in their time.
Relations between data is defined by embedding document, or nested document. If your use case needs to decouple the document, you won’t be able to have a guarantee of correctness between your data unless it’s in your code. You can’t rely on any schema definition.
The other aspect is around transactions. Vendors are moving to offer transactions, but this is mostly around document versions and this is not Serializable.
The scalability and speed offered by NoSQL options are impressive, but you need to adapt your code to the database client library and put some extra effort into consistency.
Distributed SQL
Here comes a new challenger! The distributed SQL options. With distributed SQL, you’ll benefit from scalability and consistency.
Sounds like the best of both worlds. And, in fact, it kind of is.
Distributed SQL solves the scalability problem of relational databases by adopting the same NoSQL mechanisms while ensuring type guarantee (aka schema), transaction correctness, and consistency.
The main tradeoff here is speed. As data needs to be replicated between nodes AND consistent, physical laws can’t be broken. It is clear you won’t be able to meet the speed of a NoSQL option. If you compare a single node SQL database to a distributed SQL database, it is for sure the latest that’ll be the slower.
The benefits come when you look at data consistency, correctness, latency, and ease of maintenance at scale.
CockroachDB is the most advanced distributed SQL offer. It was designed from the ground up to ensure ACID compliance AND scalability. There is no sharding behind the scene, but a real NoSQL-like clustering to ensure you’ll be able to have all the benefits from a relational database while keeping ease of scalability.
Last but not least, CockroachDB offers regional scalability (ie worldwide spread cluster). NoSQL options use asynchronous replication to replicate data between EMEA, the US, and APAC. With CockroachDB you have the availability to use a single cluster and survive any kind of failure without compromising on consistency and availability.
You can explore more about CockroachDB here.
So, How do you decide what will be your next database?
As you can see, choices depend on expectations for the product. Do you expect a huge increase in dataset size? Do you want to survive failures easily without breaking your applications or with downtime?
If you’re looking to have a dataset dedicated to a country, with a small number of users, and an application that can have downtime, the relational database is still the best and easiest way to go.
If you focus only on sub-millisecond response time (aka cache-like), or some sort of specific features (full text search for example), you should have a look at NoSQL solutions. You’ll certainly find the best-tailored tool for your use case.
Other than that, CockroachDB is certainly your best option. With scalability and ACID transactions guaranteed by design, it’s the easiest to operationalize and develop against. You can also benefit from a managed offer with Cockroach Cloud, and get started in less than 5 minutes with a serverless cluster.