CockroachDB

Auto-Scaling Your Distributed Database: A Bad Idea?

Daniel Holt

Oct 24, 2022 • 2 min read

In the past few years, software engineers have become increasingly aware of the importance of scaling data-intensive applications to meet the needs of its users. As with most things, though, there are both good and bad ways to go about it. One particularly bad approach that deserves a closer look is auto-scaling, which involves letting your database scale itself according to usage patterns automatically in response to ever-changing conditions in your environment. While it might seem like a convenient shortcut, there are a number of reasons why you should avoid this practice at all costs. Find out why in this article!

When an auto-scaling solution doesn’t fit

Distributed databases are a great way to scale your application, but they also require more sophisticated management. As your data grows, the database will need to be scaled up by adding new nodes to handle the load. This can take a lot of time and resources if you don't know what you're doing.

In addition, there's an increased risk for downtime because any one node may have an issue that causes it to go offline. And because distributed databases rely on consensus algorithms to ensure correctness, it is possible for errors to propagate from one node to another before the original error is detected.

Auto Scaling in Detail

If you are running a database or any other software, it is not recommended to automatically scale in and out based on your workload. You should have some type of monitoring in place so that when things are running smoothly, you don't incur costs by scaling up and then back down again.

Another thing to consider is how long it takes for the system to reach full capacity from autoscale being triggered. What if there is an issue with the autoscale during this time period, can your application handle all the requests or does something else go wrong?

Let's look at an example of why this is a bad idea

Set up an alert for 80% CPU utilization over a sustained 60 Seconds - This means your server is under Load
Autoscaling kicks in and adds a node to an already under-load cluster - The DB cluster now has to move data to the new load to balance out the data and CPU utilization but doing this puts even more load on the existing servers meaning CPU utilization will be pushed higher on the already under load servers causing even more performance impact
Other cluster semantics have to rebalance to such as leaseholders in distributed database systems

As you can see the worst time to autoscale a cluster would be when its under load. A way around this is to scale the cluster when cpu utilization is much lower say around 50% but then your actually guess when you believe spikey traffic is going to hit the cluster

Avoiding Autoscaling Pitfalls

Don't autoscale prematurely. If your database is still growing, and you have not yet reached peak capacity, then you don't need to autoscale.

Don't autoscale for the wrong reasons. Just because your database is scaling does not mean it's time to add more nodes.

Don't assume infrastructure is the problem. You can look at optimising the workload before thinking about autoscaling

Don't over autoscale in response to a spike in traffic.

It is nearly always better to capacity plan for peak workloads. If this is not possible then its always better to scale up in advance of busy periods.

Sign up for more like this.