Building a multi-regional, highly available scheduler with AWS

Modern distributed systems often operate across multiple regions to ensure high availability and low latency. These systems frequently require tools to schedule HTTP requests for future execution, enabling workflows like notifications, deferred API calls, or time-based triggers. In a multi-regional setup, such a scheduler must address challenges like consistency and failover resilience. This article outlines how to build a highly available, multi-regional scheduler using AWS services, ensuring at least one delivery policy for critical workflows.
There are also some additional requirements for this tool to fulfill:
- Ability to send thousands of requests each second with seconds accuracy (In a happy path scenario. It is okay to deliver them a bit later in case of some service outages).
- At least one delivery is a must. We also have to aim for the best effort exactly once delivery.
- Sending each request in the region where it was created (cost savings + easier data consistency) - our target endpoints may be regional, i.e., the scheduler needs to be aware of a region where it should send the request to.
- Ability to perform region failover - we need to be able to drain traffic coming out of the scheduler in one region and switch it fully to a different one in order to deliver all schedules there. (i.e., even though given schedules were created in
us-east-1, they should be delivered inus-east-2during the region failover). Region failover does not mean killing all the service instances in a given region.
Why not use the Event Bridge Scheduler?
We surely thought about it at first, as it seemed like an out-of-the-box solution which would suit our use case, but we encountered the following issues when looking into it more deeply:
- HTTP endpoints as scheduler targets are not supported. Lambda could be used here as a mid-layer, which is not that big an issue, but surely adds some complexity to the architecture
- Most importantly though, EventBridge scheduler schedules are not replicated and kept in sync between the regions, hence it does not support region failover out of the box. Schedules that were created in a given region would have to be delivered there too.
All in all, while analyzing the EventBridge Scheduler approach against our requirements, we came to the conclusion that in order for this architecture to fulfill our requirements, we would need to add a lot of puzzle pieces next to this AWS service. This might make the whole thing very complicated and potentially flaky. Considering that, it would be difficult to call it an out-of-the-box solution. The custom tool on the other hand, would give us more flexibility and the ability to enhance the list of supported features according to our needs in the future. This is why we decided to build our own scheduler and deploy it following the multi-AZ and multi-regional setup. All the next sections describe the high-level architecture of this idea.
Persistency and simple API
First of all, we want to persist schedules in the database and have all the information there for easier management and audit.
We decided to go for DynamoDB as the NoSQL database would give us good performance, scalability and if we plan our data structure correctly, we should be able to live without performing complex queries.
Additionally, we chose the DDB global table feature to leverage the active-active replication across our regions to support region failover (more about it later).
In order to avoid having hot partitions the scheduleId is used as a partitionKey. The value can be generated on each schedule creation.
Since we have a database, which stores all of the schedules, we can now expose an API which will allow performing CRUD operations on schedules.

Cache
We could basically end this article here by creating a Sender component which will simply query the DynamoDB table every second (this would require having a dedicated GSI (Global Secondary Index) value to retrieve all schedules which are to be executed in a given second) and simply send them to their target endpoints.
As simple as it may sound, this is not the greatest idea out there, as calling a DynamoDB table every second would put a huge load on it and would increase the overall cost of this architecture. It seems much smarter to bucketize the schedules into e.g. 10 minute chunks and preload them into cache earlier and, when the right time comes, lookup the cache and send what needs to be sent.

Sender fetching schedules directly from a database vs. Sender using a cache layer.
Additionally this idea gives us all the more obvious benefits of using cache, like decreased latency, costs of calling the database, better scalability etc.
It is important to note that these 10-minute buckets need to be regional, i.e. will contain schedules from a given region exclusively.

Two buckets covering the same time frame in two different regions contain different regional schedules.
What service can be used as the cache layer?
We had to reject the idea of using the in-memory cache pretty quickly as our service is supposed to work in a multi-region and, more importantly from the caching perspective, multi-instance (within one region) setup, hence the cache needs to be shared across the regional instances for synchronization. (i.e., we don’t want to send all the schedules by all the instances as might happen with independent cache per instance).
At first, it seemed that using ElastiCache would be the best choice we have, but we actually decided to go for an SQS queue instead. This (probably not obvious) choice can be explained because of the following reasons:
- Messages in the SQS queue are persistent (having the same with ElastiCache with Redis would also be possible but would require some more puzzle pieces). This means that once the schedule ends up on the
CacheQueue, we can trust the at-least-once delivery policy there and be sure that the schedule should be processed by theSendereventually. - Having an SQS queue as a cache lets us have more than one
Senderwithout the need to synchronize them. - We can leverage the queue’s retry policy if
Senderunexpectedly fails for whatever reason. - SQS has a feature called message timers, which we decided to make use of in order to be able to execute schedules at a given point in time even though they were cached earlier.

Sender
Nothing very fancy here, the Sender component simply polls for messages from the CacheQueue and once schedules appear, they get sent. Multiple Senders can be used there in order to increase the performance (we will use just one for simplicity). After a successful delivery the schedules are marked as sent (their GSI is removed and the status is changed) and stored in the database.
Reader
The high level job of this component is to “wake up” at a specified time, read all the schedules forming a given bucket from the database and load them onto the CacheQueue (calculating the delay first).
The actual shape of the component boils down to the following:

Reader component built from the ReaderTaskProducer, a FIFO queue, and the ReaderTaskConsumer
ReaderTaskProducer
The only responsibility of this component is to wake up every ten minutes (a bit earlier than a given bucket starts - so if we are considering 12.10 AM - 12.20 AM bucket, it can wake up at e.g. 12.07 AM), create a ReaderTask for the given bucket and put it onto a FIFO queue. ReaderTask is a simple command which instructs the Reader that a given regional bucket should be loaded from the database and put onto cache.
This is the right time to emphasise the fact that our service may have multiple instances within a given region (or even AZ) working simultaneously. Our job is to make sure that each bucket/schedule does not get stored in cache multiple times.
The whole purpose of cutting the Reader into two components (ReaderTaskProducer + ReaderTaskConsumer) is to put an SQS FIFO queue in between. In order to achieve the single ReaderTask per bucket on the queue, the SQS FIFO deduplication mechanism has to be used (A bucket id which is the same across the service instances is used as the deduplicationId).
ReaderTaskConsumer
This part of the Reader polls for messages from the FIFO queue, and whenever a ReaderTask message occurs there, it will be processed by only one ReaderTaskConsumer. (To be more precise - ReaderTaskConsumers from other service instances will not process the record. VisibilityTimeout SQS feature has been used to achieve that)1.
Reading the provided footnotes is not necessary to get a high-level understanding of this architecture. However, some concepts are described there more thoroughly, so they definitely help with getting the full picture.
If it comes to the processing itself - the component queries the database using the GSI to retrieve all the schedules from a given bucket. It then calculates the delay before putting each of them onto the CacheQueue.
GSI shape
nextAttemptTimestamp: 2025-10-01T17:16:04Z
gsi: us-east-1#2025-10-01T17:10:00ZNextAttemptTimestamp has been “rounded” down to the nearest 10 minutes in order to form the GSI which represents the 10-minute long buckets.
Retries
What if a given schedule’s delivery fails for whatever reason? Following the best practices, the retry policy is applied.
In order to support retries, the Sender component bumps the schedule’s attempts counter, calculates the next delivery time (exponential backoff with jitter can be used here) and, based on that, updates the GSI value of the record (if needed) and stores it in the database. This way, the schedule gets moved to a different bucket and will be processed by the upcoming bucket iterations2.
However, if the maximum number of attempts is drained, the schedule is marked as skipped and an alert gets raised (we will not focus on how the alerting mechanism was done within this article).
Flow Summary
ReaderTaskProducerscreateReaderTasksand store them onto the SQS queue. Only one message actually ends up there.- One
ReaderTaskConsumerprocesses theReaderTask.
a. The bucket gets fetched from the DynamoDb table.
b. A delay for each of the schedules is calculated.
c. All schedules are put onto theCacheQueueusing the message timers feature. - When the delay time elapses the
Senderreads a given schedule and sends it to a target endpoint.
a. In a successful scenario, the schedule gets marked as delivered.
b. In case of a failure, the retry policy is applied.
Region Failover
As mentioned in the beginning of this article, one of the main challenges of this solution was to support the ability to “switch off” a given region at any point in time and deliver all the schedules from this moment on in one of the other regions (target endpoints are regional).
It is important to emphasise that this architecture leverages background tasks which do all the work. This means that it is not enough to route all the CRUD API calls to the service to another region as the schedules that were created earlier will still be sent in the potentially faulty region.
Hence, to fully drain traffic from a given region, all of the scheduler instances in the failovered region must pass the turn off command to all of their components, so that they stop doing their job.
A similar thing applies to the takeover region (region which will send the schedules from the failovered region, during the outage) - service instances' components need to be aware of a mode switch.
We decided to store the information about the current region failover status in an SSM Parameter Store and poll for its value changes to detect the region failover3.
Let’s start with the architecture changes to understand how to support this requirement:

FailoverPoller
This component is responsible for polling the SSM parameter value, and once the change is detected, it passes that information further to the Orchestrator.
ReaderTaskProducer
As mentioned earlier the buckets which are being fetched from the database are regional. A ReaderTask is a command which tells the ReaderTaskConsumer which bucket is supposed to be processed. In case of a region failover, the ReaderTaskProducer needs to stop working, so that no more ReaderTasks for this region are fed the FIFO queue. This way, the schedules will not be put onto cache, and hence, will not be sent from this region.
If it comes to the takeover region, an additional ReaderTaskProducer instance can be launched, next to the default one, in order to generate the commands for the failovered region. In return, the ReaderTaskConsumer will put the schedules onto cache for the Sender to process, to support the full region switch4.
Sender
The Sender needs to receive a set of regions from which it should process (send) the schedules. By default, the component is supposed to process schedules from the region where it operates, but once the failover happens, it should stop sending immediately. Other than that, it should purge the CacheQueue from all the remaining messages that are there (schedules will be picked up in the other region), so that they don’t get sent again once the failover is done and the Sender starts consuming from the queue in a default mode again.
Sender in the recovery region on the other hand, has to be simply told to send the schedules from its original region + from the failovered region5.
Orchestrator
In order for the ReaderTaskProducers and Sender to work in a specific mode that can be switched at runtime, an Orchestrator component should be introduced.
The Orchestrator owns the Sender and ReaderTaskProducers components instances. This way once the failover information is passed to the Orchestrator, it can switch off the working components and potentially start the new ones in a different mode.
Region failover example
Let’s assume that the us-east-1 region needs to be switched off because of some outage (or simply for testing, which by the way is a good practice too).
- SSM regional parameters are changed.
us-east-1is evacuated andus-east-2is promoted to the recovery region. FailoverPollerdetects the parameter value change and passes the information to theOrchestrator.us-east-1
3.1Orchestratorswitches off theReaderTaskProducerwhich was generating theus-east-1ReaderTasks. It also tells theSenderto purge the messages related tous-east-1schedules to drain the queue.
3.2 No more schedules are fed to the cache.
3.3 All the schedules that were in the cache already, get purged whenever they become visible to theSender, hence nothing gets sent from this region.us-east-2
4.1Orchestratorstarts a newReaderTaskProducerinstance which will be responsible for theus-east-1buckets. It also tells theSenderto send schedules fromus-east-1andus-east-2regions6.
4.2ReaderTaskProducersgenerate theReaderTasksrelated tous-east-1(starting e.g., a couple days earlier) andus-east-2regions.
4.3.ReaderTaskConsumersfetch buckets fromus-east-1andus-east-2regions and feed the cache with all these schedules.
4.4.Sendersends all the schedules from the two regions.
Region recovery example (failover is switched off)
- SSM regional parameters are changed. Life goes back to normal.
FailoverPollerdetects the parameter value change and passes the information to theOrchestrator.us-east-1
2.1Orchestratorstarts the defaultReaderTaskProducerand tells theSenderto start sending schedules fromus-east-16
2.2ReaderTasksforus-east-1get generated (starting e.g., a couple of days earlier), and the schedules end up in the cache.
2.3 Schedules are sent.us-east-2
3.1Orchestratorstops theReaderTaskProducerinstance which was responsible for theus-east-1buckets. It also tells theSenderto stop processingus-east-1schedules and send the ones fromus-east-2exclusively.
3.2 Only theus-east-2ReaderTasksget generated, and these schedules are fed to the cache.
3.3.Senderprocesses theus-east-2schedules. From this moment on, all theus-east-1schedules already in the cache are being purged from the queue.
The solution described above has been implemented and deployed successfully to the production environment, demonstrating its ability to handle desired workloads with reliability and efficiency. It ensures the expected scheduling precision, high availability, and fault tolerance, meeting the demands of the whole system.
Alternative Approaches and Considerations
Some of the decisions we had to make while building this architecture were perfect examples of the rule that applies to designing any system - trade-offs everywhere…
Talking about trade offs, we looked at this architecture idea before implementing and tried to think of some improvements. First of all, we thought that it would be nice to decouple the service even more by using existing AWS services instead of implementing all of the components as a single ECS task. Additionally, the components which leverage polling seemed suboptimal at a glance as well.
This made us think of an alternative - EventBridge scheduler + lambda target could have also been used to build the Reader component (instead of ReaderTaskProducer + SQS FIFO + ReaderTaskConsumer), but in order to support the region failover the lambda would have to be run in a specific mode, based on the SSM parameter value (It might need to call the SSM API to retrieve the value every time it is run). The same thing applies to the Sender component, which could also be replaced by the lambda. For this reason we decided that having an Orchestrator which owns the sub-components and can easily switch their modes is a choice we want to follow.
On the other hand, it should be possible to almost seamlessly replace the ReaderTaskConsumer with a lambda and this is something we consider changing in the future.
Anyway, we would love to hear some ideas on how to improve this architecture or how to solve the whole problem with a completely different approach!
We also encourage you to explore our AWS services. Perhaps there is room for a briliant cooperation.
Reviewed by: Rafał Maciak, Paweł Stępniewski, Jacek Centkowski
Notes
The message will be processed by just one
ReaderTaskConsumerin a happy path scenario. In case something goes wrong and the VisibilityTimeout elapses, it will be picked up by another instance and it is possible for the messages to actually end up on theCacheQueuemultiple times. However, the service is supposed to do its best to send everything exactly once, but the at least once delivery principle is the major one. ↩In fact the logic needs to be a bit different for the schedules which are to be retried very soon. Whenever the next attempt is within or exactly 15 minutes from now (15 minutes is a max delay supported by SQS), they are put onto cache directly and stored to DynamoDB with a skip flag=true. This way a given schedule should always end up in cache, if it still belongs to the same bucket (next attempt is to happen in e.g., 5 seconds). At the same time, if it gets moved to the next bucket (next attempt e.g.,12 minutes later), it should not be duplicated by the ReaderTaskConsumer processing that next bucket, as the consumer will filter such a schedule out using the skip flag. ↩
SSM parameter store parameters are regional and are not replicated out of the box. For this reason it is important to note that in reality each supported region has to have a dedicated parameter that is used by the service instances in that region. The regional parameters have to be kept in sync with each other.
Moreover, it is necessary to explicitly set one of the regions as a takeover one for the failover. It can be done either within the parameter or potentially somewhere in the application configuration.
Changing the parameters values can also be made automatic, leveraging an event based approach. This way, they can be kept in sync across regions automatically as well. However, this part of the architecture is out of scope of this article and will not be described here. We will only focus on what’s happening once the value is changed, informing the scheduler about the failover. ↩This additional
ReaderTaskProducerinstance starts its job by iterating over all the past buckets from e.g., a few days back until now, in order to create and feed the cache with theReaderTasksfor all of them. This way, we minimize the chance of any schedule to be lost as the takeover region should process them (if e.g., the SQS outage was the reason for the whole failover in the first place, the schedules that were to be executed between the outage and its detection might not be sent before the full region switch). As mentioned earlier, whenever a schedule is successfully sent, its GSI is removed, so such records will not be fetched from the database in the takeover region. ↩Whenever a given schedule is “moved” to a takeover region once failover happens, it is treated as it belonged to the takeover region from the very beginning, i.e. if retry policy is applied, next delivery attempts will happen in the latter region (unless that region becomes unavailable too). ↩
Some delay can/should be applied here before starting the components in order to be extra sure that nothing gets sent twice. It is especially important to minimize the chance of sending the same schedule in multiple regions almost simultaneously because this may lead to data inconsistencies. As mentioned in the requirements section, it is okay to deliver the schedules later than expected when necessary, so delaying a few schedules due to the region failover/recovery is acceptable for us. ↩ ↩

