How to Tackle ECS Autoscaling?

30 May 2025. 7 minutes read

How to Tackle ECS Autoscaling? featured image

When people talk about cloud services, the word “scalable” comes up almost every time. Of course, the statement “cloud services are scalable” is true. They make scaling easier and often work out of the box. But sometimes, we need to take control and define exactly how that scaling should work. It has been the case for our team when it came to scaling our microservices running in the ECS cluster. Let me walk you through our experience.

Product overview

Long story short - we’re a small team of three developers, sharing responsibilities across backend, frontend, and ops work, building a data platform. It doesn’t get high traffic all the time. Instead, we often see traffic spikes when clients run big workloads, like uploading many files. From a business perspective, we aim to process these workloads as quickly as possible so that users can start interacting with the newly uploaded data.

Our system is message-driven and asynchronous, with responsibilities spread across many different microservices. We use MSK (AWS-managed Kafka) as our durable message log, RDS as our main relational database, and OpenSearch (AWS-managed Elasticsearch) as the primary engine for handling data reads in our web UIs.

Initial setup and problems

At first, all our services ran in a single ECS cluster, using the same autoscaling policy. Scaling was based on memory usage - when memory usage went over 80%, the service would scale out. When it dropped below 20%, it would scale back in. It was a default applied also in other clusters running other products of our client.

As we onboarded more customers and started seeing more traffic spikes, it became clear that our setup wasn’t sufficient enough. The metrics showed that memory usage wasn’t really a problem for our microservices. Instead, we were running into CPU contention. None of our services had hard limits on memory or CPU usage, so a single container task could use up all the resources on an EC2 instance - especially during traffic peaks. In consequence, other services sometimes failed to respond to automated AWS health checks and were falling into the restart-and-retry loops.

Sometimes, when we knew a traffic peak was coming, we tried to get ahead by manually adding more EC2 instances and tasks to the cluster. It was a quick fix but also meant extra work and added stress. This approach sometimes helped, but it didn’t solve the core issues. We still faced CPU contention, didn’t know how many particular tasks we should spin up additionally, and didn’t have time to monitor the results and adjust dynamically.

Determine scaling factors

At some point, we finally had time to tackle our autoscaling issues. The first step in solving them is identifying the right scaling factors for your deployment. The best way to do that is by checking your metrics and understanding your code. As mentioned earlier, we already knew the CPU was a bottleneck, so that became our first scaling factor. The second one, common in message-driven systems, was processing lag on the event stream. In our case, that meant the consumer group lags on several core Kafka topics.

We also realized that some of our services mainly handle HTTP requests that result in simple database reads and writes. These services don’t need as many resources as others that perform CPU-intensive tasks, like image processing, which depend more on the consumer group lag.

We decided to separate these two groups by splitting them into two ECS clusters: one for CPU-intensive services and another for lightweight ones. This allows us to apply different autoscaling strategies and monitor each cluster independently.

Determine resource limits

One can solve issues with CPU contention by adding hard resource limits to ECS tasks. But that comes with some essential trade-offs and consequences. First, if a task is idle, the resources are wasted. Second, it’s tough to come up with good numbers for the hard limits - you need to know or even measure how much CPU and memory a particular service needs under different workloads.

Instead, we chose to use soft limits for our ECS container tasks. So what’s the difference? With soft limits, tasks that share the same resource ratio, say 0.5 vCPU, can share those resources with each other as needed. For example, if two tasks have a soft limit of 0.5 vCPU, and one is idle while the other is doing CPU-intensive work, the busy task can temporarily use the unused 0.5 vCPU from the idle one. When the idle task becomes active again, it simply gets the CPU it needs - no contention or competition. It results in better overall resource utilization and doesn’t require the same precision or strict configuration as hard limits.

Regular cluster

As mentioned earlier, we split our services into two ECS clusters. The “regular” cluster runs services that mainly handle HTTP traffic. These services scale based on CPU usage. In practice, it’s rare for this cluster to spin up more EC2 instances beyond the initial minimum. During heavy workloads, a few core HTTP services facilitate the most critical business processes, like those handling file uploads and parsing, which scale out with a couple of extra instances, but that’s usually it.

Most of these services perform relational database operations on our RDS instance, which is shared across microservices. Each service owns a separate RDS schema. We sometimes hit the database connection limit during scale-out since each task maintains its connection pool. To solve this, we introduced an RDS Proxy into our environment, which helps manage and distribute connections more efficiently. However, this comes with a trade-off - slightly increased latency. To balance this, a handful of core services critical for workload performance still connect directly to RDS. The less necessary or less frequently used services handle database traffic through the proxy. We haven’t run into connection issues since then. Eventually, in the future, we would vertically scale our RDS instance or analyze the load and perhaps take advantage of RDS read replicas.

Compute cluster

The second cluster handles our CPU-intensive services and scales primarily based on consumer group lag. By default, each service starts with a single running task. This is essentially a standby mode - we are ready to process some Kafka messages without needing to scale out if the workload is small. This cluster uses a more aggressive scaling strategy than the regular one.

It means that when lag is above some threshold for a defined period, services scale out with higher scaling steps (depending on the service), from 1 task instance to 5, then to 10, and so on. Since we don’t use any fine-grained partitioning strategy for our Kafka topics, we can expect messages to be distributed relatively evenly. So we don’t need to worry about partition assignment during scaling. Therefore, the maximum number of task instances is capped at the number of partitions per topic, which in our case is 30. The scale-down factor is straightforward - when consumer group lag drops to 0, we return to a single task instance.

Fine tuning

To find the right scaling thresholds and step sizes, we ran many tests with different configurations to see how the platform responded. There’s really no shortcut here - it takes time and hands-on testing with real workloads. We ran load tests on our staging environment and later in production as well.

At this stage, having proper monitoring in place is crucial. For us, some key metrics that indicate scaling effectiveness are CPU and memory utilization - both at the cluster and per service level, number of EC2 and task instances, response times, and consumer groups lag. It’s also a good idea to monitor your single point of failure,, like databases, as they could experience more load than your team used to observe when the platform scales out. In our case, we identified and optimized several SQL calls in services that were causing high CPU usage on the shared RDS instance.

In the end, we found a setup that worked well for us. During heavy upload workloads, when large batches of files were sent to the platform, our compute cluster could keep up with all the asynchronous processing, providing all the results almost immediately after the uploads finished. Fine-tuning depends heavily on the characteristics of a workflow you're trying to scale. It’s different for every project, so it’s essential to understand your system deeply in the first place.

Summary

The whole process took over a month, including fine-tuning and load testing. We learned a lot about our services and ECS autoscaling itself. From a business perspective, we delivered better overall platform performance, directly impacting how our clients use our product. As developers, we also gained peace of mind, as there’s less stress over traffic peaks and we no longer need to step in and try to improve something manually.

In my opinion, the two most important steps when working on ECS autoscaling are identifying the right scaling factors which means understanding your workload and fine-tuning the scaling parameters. Every product is different and some things are hard to generalize, there’s no one-size-fits-all solution. Take your time, rely on your metrics and load test results, and finally, you will get in the right spot.

Thanks for reading. I hope you find this short article helpful.

P.S. If you are looking for an opportunity to improve observability with Grafana, read this article.

And if you're looking for a trustworthy expert specializing in Apache Kafka, visit our Confluent Premium Partner page. It might be the beginning of a great adventure.

Reviewed by: Michał Ostruszka

Services overview

Why Choose Us

How we work

Portfolio

Technologies

About us

Join Us

Open Source

Scalar Conference

Services overview

How we work

Technologies

About us

Open Source

Why Choose Us

Why Choose Us

Why Choose Us

Why Choose Us

Join Us

Scalar Conference

Scalar Conference

Scalar Conference

Scalar Conference

Contents

Contents

How to Tackle ECS Autoscaling?

Product overview

Initial setup and problems

Determine scaling factors

Determine resource limits

Regular cluster

Compute cluster

Fine tuning

Summary