Contents

Building a multi-regional, highly available scheduler with AWS

Building a multi-regional, highly available scheduler with AWS webp image

Modern distributed systems often operate across multiple regions to ensure high availability and low latency. These systems frequently require tools to schedule HTTP requests for future execution, enabling workflows like notifications, deferred API calls, or time-based triggers. In a multi-regional setup, such a scheduler must address challenges like consistency and failover resilience. This article outlines how to build a highly available, multi-regional scheduler using AWS services, ensuring at least one delivery policy for critical workflows.

There are also some additional requirements for this tool to fulfill:

  • Ability to send thousands of requests each second with seconds accuracy (In a happy path scenario. It is okay to deliver them a bit later in case of some service outages).
  • At least one delivery is a must. We also have to aim for the best effort exactly once delivery.
  • Sending each request in the region where it was created (cost savings + easier data consistency) - our target endpoints may be regional, i.e., the scheduler needs to be aware of a region where it should send the request to.
  • Ability to perform region failover - we need to be able to drain traffic coming out of the scheduler in one region and switch it fully to a different one in order to deliver all schedules there. (i.e., even though given schedules were created in us-east-1, they should be delivered in us-east-2 during the region failover). Region failover does not mean killing all the service instances in a given region.

Why not use the Event Bridge Scheduler?

We surely thought about it at first, as it seemed like an out-of-the-box solution which would suit our use case, but we encountered the following issues when looking into it more deeply:

  • HTTP endpoints as scheduler targets are not supported. Lambda could be used here as a mid-layer, which is not that big an issue, but surely adds some complexity to the architecture
  • Most importantly though, EventBridge scheduler schedules are not replicated and kept in sync between the regions, hence it does not support region failover out of the box. Schedules that were created in a given region would have to be delivered there too.

All in all, while analyzing the EventBridge Scheduler approach against our requirements, we came to the conclusion that in order for this architecture to fulfill our requirements, we would need to add a lot of puzzle pieces next to this AWS service. This might make the whole thing very complicated and potentially flaky. Considering that, it would be difficult to call it an out-of-the-box solution. The custom tool on the other hand, would give us more flexibility and the ability to enhance the list of supported features according to our needs in the future. This is why we decided to build our own scheduler and deploy it following the multi-AZ and multi-regional setup. All the next sections describe the high-level architecture of this idea.

Persistency and simple API

First of all, we want to persist schedules in the database and have all the information there for easier management and audit.

We decided to go for DynamoDB as the NoSQL database would give us good performance, scalability and if we plan our data structure correctly, we should be able to live without performing complex queries.

Additionally, we chose the DDB global table feature to leverage the active-active replication across our regions to support region failover (more about it later).

In order to avoid having hot partitions the scheduleId is used as a partitionKey. The value can be generated on each schedule creation.

Since we have a database, which stores all of the schedules, we can now expose an API which will allow performing CRUD operations on schedules.

api

Cache

We could basically end this article here by creating a Sender component which will simply query the DynamoDB table every second (this would require having a dedicated GSI value to retrieve all schedules which are to be executed in a given second) and simply send them to their target endpoints.

As simple as it may sound, this is not the greatest idea out there, as calling a DynamoDB table every second would put a huge load on it and would increase the overall cost of this architecture. It seems much smarter to bucketize the schedules into e.g. 10 minute chunks and preload them into cache earlier and, when the right time comes, lookup the cache and send what needs to be sent.

sender

Sender fetching schedules directly from a database vs. Sender using a cache layer.

Additionally this idea gives us all the more obvious benefits of using cache, like decreased latency, costs of calling the database, better scalability etc.

It is important to note that these 10-minute buckets need to be regional, i.e. will contain schedules from a given region exclusively.

buckets

Two buckets covering the same time frame in two different regions contain different regional schedules.

What service can be used as the cache layer?

We had to reject the idea of using the in-memory cache pretty quickly as our service is supposed to work in a multi-region and, more importantly from the caching perspective, multi-instance (within one region) setup, hence the cache needs to be shared across the regional instances for synchronization. (i.e., we don’t want to send all the schedules by all the instances as might happen with independent cache per instance).

At first, it seemed that using ElastiCache would be the best choice we have, but we actually decided to go for an SQS queue instead. This (probably not obvious) choice can be explained because of the following reasons:

  • Messages in the SQS queue are persistent (having the same with ElastiCache with Redis would also be possible but would require some more puzzle pieces). This means that once the schedule ends up on the CacheQueue, we can trust the at-least-once delivery policy there and be sure that the schedule should be processed by the Sender eventually.
  • Having an SQS queue as a cache lets us have more than one Sender without the need to synchronize them.
  • We can leverage the queue’s retry policy if Sender unexpectedly fails for whatever reason.
  • SQS has a feature called message timers, which we decided to make use of in order to be able to execute schedules at a given point in time even though they were cached earlier.

sender

Sender

Nothing very fancy here, the Sender component simply polls for messages from the CacheQueue and once schedules appear, they get sent. Multiple Senders can be used there in order to increase the performance (we will use just one for simplicity). After a successful delivery the schedules are marked as sent (their GSI is removed and the status is changed) and stored in the database.

Reader

The high level job of this component is to “wake up” at a specified time, read all the schedules forming a given bucket from the database and load them onto the CacheQueue (calculating the delay first).

The actual shape of the component boils down to the following:

Reader component

Reader component built from the ReaderTaskProducer, a FIFO queue, and the ReaderTaskConsumer

ReaderTaskProducer

The only responsibility of this component is to wake up every ten minutes (a bit earlier than a given bucket starts - so if we are considering 12.10 AM - 12.20 AM bucket, it can wake up at e.g. 12.07 AM), create a ReaderTask for the given bucket and put it onto a FIFO queue. ReaderTask is a simple command which instructs the Reader that a given regional bucket should be loaded from the database and put onto cache.

This is the right time to emphasise the fact that our service may have multiple instances within a given region (or even AZ) working simultaneously. Our job is to make sure that each bucket/schedule does not get stored in cache multiple times.

The whole purpose of cutting the Reader into two components (ReaderTaskProducer + ReaderTaskConsumer) is to put an SQS FIFO queue in between. In order to achieve the single ReaderTask per bucket on the queue, the SQS FIFO deduplication mechanism has to be used (A bucket id which is the same across the service instances is used as the deduplicationId).

ReaderTaskConsumer

This part of the Reader polls for messages from the FIFO queue, and whenever a ReaderTask message occurs there, it will be processed by only one ReaderTaskConsumer. (To be more precise - ReaderTaskConsumers from other service instances will not process the record. VisibilityTimeout SQS feature has been used to achieve that)1.

Reading the provided footnotes is not necessary to get a high-level understanding of this architecture. However, some concepts are described there more thoroughly, so they definitely help with getting the full picture. 

If it comes to the processing itself - the component queries the database using the GSI to retrieve all the schedules from a given bucket. It then calculates the delay before putting each of them onto the CacheQueue.

GSI shape

nextAttemptTimestamp: 2025-10-01T17:16:04Z 
gsi: us-east-1#2025-10-01T17:10:00Z

NextAttemptTimestamp has been “rounded” down to the nearest 10 minutes in order to form the GSI which represents the 10-minute long buckets.

Retries

What if a given schedule’s delivery fails for whatever reason? Following the best practices, the retry policy is applied.

In order to support retries, the Sender component bumps the schedule’s attempts counter, calculates the next delivery time (exponential backoff with jitter can be used here) and, based on that, updates the GSI value of the record (if needed) and stores it in the database. This way, the schedule gets moved to a different bucket and will be processed by the upcoming bucket iterations2.
However, if the maximum number of attempts is drained, the schedule is marked as skipped and an alert gets raised (we will not focus on how the alerting mechanism was done within this article).

Flow Summary

  1. ReaderTaskProducers create ReaderTasks and store them onto the SQS queue. Only one message actually ends up there.
  2. One ReaderTaskConsumer processes the ReaderTask.
    a. The bucket gets fetched from the DynamoDb table.
    b. A delay for each of the schedules is calculated.
    c. All schedules are put onto the CacheQueue using the message timers feature.
  3. When the delay time elapses the Sender reads a given schedule and sends it to a target endpoint.
    a. In a successful scenario, the schedule gets marked as delivered.
    b. In case of a failure, the retry policy is applied.

Region Failover

As mentioned in the beginning of this article, one of the main challenges of this solution was to support the ability to “switch off” a given region at any point in time and deliver all the schedules from this moment on in one of the other regions (target endpoints are regional).

It is important to emphasise that this architecture leverages background tasks which do all the work. This means that it is not enough to route all the CRUD API calls to the service to another region as the schedules that were created earlier will still be sent in the potentially faulty region.

Hence, to fully drain traffic from a given region, all of the scheduler instances in the failovered region must pass the turn off command to all of their components, so that they stop doing their job.

A similar thing applies to the takeover region (region which will send the schedules from the failovered region, during the outage) - service instances' components need to be aware of a mode switch.

We decided to store the information about the current region failover status in an SSM Parameter Store and poll for its value changes to detect the region failover3.

Let’s start with the architecture changes to understand how to support this requirement:

architecture changes

FailoverPoller

This component is responsible for polling the SSM parameter value, and once the change is detected, it passes that information further to the Orchestrator.

ReaderTaskProducer

As mentioned earlier the buckets which are being fetched from the database are regional. A ReaderTask is a command which tells the ReaderTaskConsumer which bucket is supposed to be processed. In case of a region failover, the ReaderTaskProducer needs to stop working, so that no more ReaderTasks for this region are fed the FIFO queue. This way, the schedules will not be put onto cache, and hence, will not be sent from this region.

If it comes to the takeover region, an additional ReaderTaskProducer instance can be launched, next to the default one, in order to generate the commands for the failovered region. In return, the ReaderTaskConsumer will put the schedules onto cache for the Sender to process, to support the full region switch4.

Sender

The Sender needs to receive a set of regions from which it should process (send) the schedules. By default, the component is supposed to process schedules from the region where it operates, but once the failover happens, it should stop sending immediately. Other than that, it should purge the CacheQueue from all the remaining messages that are there (schedules will be picked up in the other region), so that they don’t get sent again once the failover is done and the Sender starts consuming from the queue in a default mode again.

Sender in the recovery region on the other hand, has to be simply told to send the schedules from its original region + from the failovered region5.

Orchestrator

In order for the ReaderTaskProducers and Sender to work in a specific mode that can be switched at runtime, an Orchestrator component should be introduced.

The Orchestrator owns the Sender and ReaderTaskProducers components instances. This way once the failover information is passed to the Orchestrator, it can switch off the working components and potentially start the new ones in a different mode.

Region failover example

Let’s assume that the us-east-1 region needs to be switched off because of some outage (or simply for testing, which by the way is a good practice too).

  1. SSM regional parameters are changed. us-east-1 is evacuated and us-east-2 is promoted to the recovery region.
  2. FailoverPoller detects the parameter value change and passes the information to the Orchestrator.
  3. us-east-1
    3.1 Orchestrator switches off the ReaderTaskProducer which was generating the us-east-1 ReaderTasks. It also tells the Sender to purge the messages related to us-east-1 schedules to drain the queue.
    3.2 No more schedules are fed to the cache.
    3.3 All the schedules that were in the cache already, get purged whenever they become visible to the Sender, hence nothing gets sent from this region.
  4. us-east-2
    4.1 Orchestrator starts a new ReaderTaskProducer instance which will be responsible for the us-east-1 buckets. It also tells the Sender to send schedules from us-east-1 and us-east-2 regions6.
    4.2 ReaderTaskProducers generate the ReaderTasks related to us-east-1 (starting e.g., a couple days earlier) and us-east-2 regions.
    4.3. ReaderTaskConsumers fetch buckets from us-east-1 and us-east-2 regions and feed the cache with all these schedules.
    4.4. Sender sends all the schedules from the two regions.

Region recovery example (failover is switched off)

  1. SSM regional parameters are changed. Life goes back to normal.
  2. FailoverPoller detects the parameter value change and passes the information to the Orchestrator.
  3. us-east-1
    2.1 Orchestrator starts the default ReaderTaskProducer and tells the Sender to start sending schedules from us-east-1.6
    2.2 ReaderTasks for us-east-1 get generated (starting e.g., a couple of days earlier), and the schedules end up in the cache.
    2.3 Schedules are sent.
  4. us-east-2
    3.1 Orchestrator stops the ReaderTaskProducer instance which was responsible for the us-east-1 buckets. It also tells the Sender to stop processing us-east-1 schedules and send the ones from us-east-2 exclusively.
    1. Only the us-east-2 ReaderTasks get generated, and these schedules are fed to the cache.
    2. Sender processes the us-east-2 schedules. From this moment on, all the us-east-1 schedules already in the cache are being purged from the queue.

The solution described above has been implemented and deployed successfully to the production environment, demonstrating its ability to handle desired workloads with reliability and efficiency. It ensures the expected scheduling precision, high availability, and fault tolerance, meeting the demands of the whole system.

Alternative Approaches and Considerations

Some of the decisions we had to make while building this architecture were perfect examples of the rule that applies to designing any system - trade-offs everywhere…

Talking about trade offs, we looked at this architecture idea before implementing and tried to think of some improvements. First of all, we thought that it would be nice to decouple the service even more by using existing AWS services instead of implementing all of the components as a single ECS task. Additionally, the components which leverage polling seemed suboptimal at a glance as well.

This made us think of an alternative - EventBridge scheduler + lambda target could have also been used to build the Reader component (instead of ReaderTaskProducer + SQS FIFO + ReaderTaskConsumer), but in order to support the region failover the lambda would have to be run in a specific mode, based on the SSM parameter value (It might need to call the SSM API to retrieve the value every time it is run). The same thing applies to the Sender component, which could also be replaced by the lambda. For this reason we decided that having an Orchestrator which owns the sub-components and can easily switch their modes is a choice we want to follow.

On the other hand, it should be possible to almost seamlessly replace the ReaderTaskConsumer with a lambda and this is something we consider changing in the future.

Anyway, we would love to hear some ideas on how to improve this architecture or how to solve the whole problem with a completely different approach!

Reviewed by: Rafał Maciak, Paweł Stępniewski, Jacek Centkowski

Notes


  1. The message will be processed by just one ReaderTaskConsumer in a happy path scenario. In case something goes wrong and the VisibilityTimeout elapses, it will be picked up by another instance and it is possible for the messages to actually end up on the CacheQueue multiple times. However, the service is supposed to do its best to send everything exactly once, but the at least once delivery principle is the major one. 

  2. In fact the logic needs to be a bit different for the schedules which are to be retried very soon. Whenever the next attempt is within or exactly 15 minutes from now (15 minutes is a max delay supported by SQS), they are put onto cache directly and stored to DynamoDB with a skip flag=true. This way a given schedule should always end up in cache, if it still belongs to the same bucket (next attempt is to happen in e.g., 5 seconds). At the same time, if it gets moved to the next bucket (next attempt e.g.,12 minutes later), it should not be duplicated by the ReaderTaskConsumer processing that next bucket, as the consumer will filter such a schedule out using the skip flag. 

  3. SSM parameter store parameters are regional and are not replicated out of the box. For this reason it is important to note that in reality each supported region has to have a dedicated parameter that is used by the service instances in that region. The regional parameters have to be kept in sync with each other.
    Moreover, it is necessary to explicitly set one of the regions as a takeover one for the failover. It can be done either within the parameter or potentially somewhere in the application configuration.
    Changing the parameters values can also be made automatic, leveraging an event based approach. This way, they can be kept in sync across regions automatically as well. However, this part of the architecture is out of scope of this article and will not be described here. We will only focus on what’s happening once the value is changed, informing the scheduler about the failover. 

  4. This additional ReaderTaskProducer instance starts its job by iterating over all the past buckets from e.g., a few days back until now, in order to create and feed the cache with the ReaderTasks for all of them. This way, we minimize the chance of any schedule to be lost as the takeover region should process them (if e.g., the SQS outage was the reason for the whole failover in the first place, the schedules that were to be executed between the outage and its detection might not be sent before the full region switch). As mentioned earlier, whenever a schedule is successfully sent, its GSI is removed, so such records will not be fetched from the database in the takeover region. 

  5. Whenever a given schedule is “moved” to a takeover region once failover happens, it is treated as it belonged to the takeover region from the very beginning, i.e. if retry policy is applied, next delivery attempts will happen in the latter region (unless that region becomes unavailable too). 

  6. Some delay can/should be applied here before starting the components in order to be extra sure that nothing gets sent twice. It is especially important to minimize the chance of sending the same schedule in multiple regions almost simultaneously because this may lead to data inconsistencies. As mentioned in the requirements section, it is okay to deliver the schedules later than expected when necessary, so delaying a few schedules due to the region failover/recovery is acceptable for us.  

Blog Comments powered by Disqus.