Things I wish I knew when I started with Event Sourcing - part 2, consistency
In the first part of the series, we went through some great strengths of event sourcing: time traveling and the ability to answer virtually any analytical questions related to the service with the events stream at hand. Here, let’s first look at some common beliefs about consistency guarantees when using event sourcing and then contrast them with facts.
The deal with consistency
It’s common to see eventual consistency being mentioned in one go with CQRS and specifically event sourcing. It’s usually taken for granted that event sourced systems are eventually consistent: there is a write side that adds events to the storage and there is this whole read side separated, consuming events, where particular read models can be built per use case with different storage options depending on the requirements and the case at hand. That’s obviously not a bad setup as it gives certain benefits like dedicated database type for some reads, isolation, and scaling of reads independently from writes, etc. But as we know: there is no universally best choice, it’s all about tradeoffs - which less important traits you’re willing to give up to gain some more important ones. Although is eventual consistency the only option? What if you don’t want or have to deal with it for some reason?
The truth is that eventual consistency and specifically separate database for reads is in no way a strict requirement for event sourced systems (and for CQRS approach in general). Not only you don’t need to have separate databases, but you can also achieve strict consistency for your read side if needed. And again, there is no single best approach to that, rather there are different possible options depending on what you design for.
Speaking of eventual consistency, you may experience it even with one database for both reads and writes! If your writes go through a primary node (or leader, coordinator, etc., you name it) but you read from one of the asynchronously updated replica nodes - boom, your reads are now eventually consistent, even with a single logical database.
Let’s see how to achieve strict consistency when needed, what we have to trade off for it and how we can progressively go towards eventual consistency by discussing a range of potential solutions. We’ll only talk about read side consistency aspects and how to deal with them, assuming writes (command handling) are always consistent, e.g. one ACID-compliant node, handling write operations exclusively. We’ll hopefully talk about scaling writes beyond that limitation later on.
On-demand reads based on event stream
We already know (from the way we handle commands) how to get the latest state: replay and apply all events from a given stream first and then process a command with that state at hand. If your requirement is to always generate replies derived from the most up to date state, you may decide to do this exact same thing on the query side: serve them with the state built on-demand on query time using event streams from your primary storage, the same database you write events to. That way you always end up with the most up to date state rebuilt from the most recent events stream.
Of course there are some gotchas to this approach. In practice, it’s limited to the cases where queries span either a single or only a few aggregates, and their event streams are not too long (whatever these two mean depends on your specific requirements). Otherwise, it may put a lot of pressure on the database as well as on the application because they need to fetch lots of events, replay, and compute all the states before replying. As a result, these queries may not perform well, and there is a significant risk of them harming writes performance while doing a bunch of such reads at the same time.
A somewhat classic example here may be a BankAccount
entity that tracks account balance. If for some reason you require current balance queries to be always up to date, this is one way to do that, but be aware of all the tradeoffs.
Additionally, this requirement of queries (especially queries from the user interface) based on the most current state is actually quite weak - all in all, at the time a response reaches the client, the data it carries is potentially stale anyway, it might have been changed already.
Transactional read models
I’m not sure if there is any better name to this one, but bear with me until I explain it. One of the drawbacks of the approach above is that you need to replay all the events on every query request. Yes, you may say that there are snapshots, but still, there are some events past the recent snapshot you need to replay. Depending on the query, the number of entities it spans, their stream size, etc., it may have significant performance and latency impact. It would be great to keep that data pre-calculated but still updated immediately whenever there is a change in the entity state (new events added). Sounds like a plain old read model, right? Indeed, but with a slight twist here. While “classic” read models approach is seen as eventually consistent because there is a separate process managing it, here we update the read model in the same transaction as we save our events in and use this consistent read model to serve queries.
This has an obvious advantage of having read models strictly up to date without a need of replaying events for queries. Using the BankAccount
example above, you may e.g., have an account_balances
table that holds and always reflects current accounts' state correctly. You also gain more flexibility when building read models, the data may aggregate many entities without the need of fetching all their events every time etc.
Drawbacks? Of course there are drawbacks, here are few of them listed. First, the solution is limited to event storages that support strong transactional semantics. Then, extending the transaction beyond simple event insertion, to handle some additional updates may have some performance impact especially in write heavy scenarios. If you decide to go that way, you’d better limit this approach only to critical read model updates. The more operations you bundle together in one transaction the longer it takes to commit and the higher the chance something goes south. Next, success of handling a command and storing corresponding events now strictly depends on whether your read model logic is correct and does not panic. You may have a bug in your projection update logic, causing the entire transaction being rolled back, which effectively fails your command processing and storing new events. Additionally you can no longer scale writes and reads sides independently. Also be aware that not all event sourcing frameworks are flexible enough to give you the ability to do transactional read models and save update projections together with persisting events, e.g. Akka/Pekko in its persistence extension doesn’t allow for that.
Serve reads from DB replicas
Now we’re entering eventual consistency territory, where reads are not guaranteed to be always up to date with the write side. As said above there are at least a few flavors we can choose from.
For read heavy systems you may want to scale reads aggressively without necessarily scaling writes, but maybe for some reason “classic” asynchronous read model is a no-go due to unacceptable delays in data consistency. E.g. your read models are updated by a regular polling of eventstore for new events and this naturally introduces too much delay. If you’re ok with some delay but you want it to be as minimal as possible there is a variation of the approach above where you still keep updating your projections transactionally during command handling, but you dedicate database replica nodes to serve reads from. Replication in modern database engines is usually quick enough to meet these needs. While being eventually consistent, this approach attempts to reduce this inconsistency window to minimum.
There is another, more dynamic flavor of it though. For some time after write (usually a few seconds, just to be fully covered) you keep serving strictly consistent queries from the primary node and switch to replicas when they are already caught-up with data after these few seconds. This one is more complicated from the application perspective as you need to keep track of replication state and control the switch moment, but it may practically reduce the inconsistency gap (except major issues) giving the benefits of both strict consistency and reads scalability in some scenarios.
Version check on read
One more interesting option that may mitigate some drawbacks of eventual consistency is to rely on version checks when querying. For this to work we need clients to be aware of it and work in tandem with our backend to handle that - here is how it works. On every successful command we return a new entity version number to the client. Version number is usually an ever-increasing sequence number bumped by 1 every time a new event is saved. It’s also useful for conflict detection on the write side, but here we simply return it to the client saying e.g: I processed your command and now your BankAccount entity has version 42. From now on whenever a client does a query to the entity state it includes the last version number it knows about (here, 42). If it happens that the read side is not yet caught up with the event assigned with version 42 (say it has only seen up to 41) we simply detect that and return e.g. some Retry-After header until it catches up and the versions start matching, then it returns a proper response.
This approach is useful if our reads span one or only a few entities and we can easily keep track of their version numbers. We could try relying on some kind of a global sequence number for all the events in all streams, but that would mean that potential inconsistency of one entity would impact queries on the other, potentially unrelated one. Sometimes though this may be what you want - e.g. have aggregating reports and track its consistency across all streams, but then instead of returning only a single entity version you’d have to also return corresponding “global” sequence number. Whether it’s useful and worth doing, it’s up to your specific use case at hand.
Embrace eventual consistency
It’s no secret that if you want to scale beyond some point, eventual consistency is your friend. With some load, performance needs and data size you may reach the limits of a single database with strict consistency (or fully synchronous replication). Of course these are the problems everyone would like to face while most of us don’t in reality. I assume they’re real needs coming from the project growth and success and it’s not just us frantically overengineering things for no good reason.
Eventual consistency gives a lot of benefits from independent reads and writes scaling through isolating both sides completely in terms of performance, availability and technologies used, ending with the ability of building an unlimited number of read models dedicated to specific tasks with dedicated storage.
Sometimes it seems like strict consistency is absolutely required, but asking the right questions to the stakeholders or product owners during business process discovery can reveal that it’s in fact not the case and you can be good with eventual consistency as well. Maybe embracing it can potentially generate interesting business opportunities? On the other hand, maybe instead of chasing for strict consistency at any price, it’s cheaper and good enough business-wise to deal with inconsistencies, manually resolving them on a business level? For example in a web shop you may want to consider invalidating client orders if you’re out of stock and sending a gift card instead. Depending on how often that happens it may be the right choice for the business.
Sometimes though approaching the same problem from another angle may surface potential options of using eventual consistency locally while staying consistent to the outer world.
To illustrate that let me pick up yet another classic example: ensuring uniqueness of some data across the domain. In our BankAccount
entities we want their owners to have secondary account number associated with each account (e.g. owner can only withdraw and transfer to this account) and for some reason we want this secondary account number to be unique across our users - no two users can have the same secondary account number assigned.
Let’s ignore our inner-selves asking questions whether the domain here is modeled the right way or is there something fishy (hint: probably there is) - it’s just an example, so bear with me for a while.
In a CRUD world we’d probably build a secondary_accounts
table linked with accounts
by id with a unique constraint set up on the secondary_account_number
field and we’d be good to go. Every insert or update to the secondary_accounts
table would be a subject of uniqueness check on DB level and we’d be fully covered here thanks to the strict consistency part of ACID guarantees.
Things change quite a bit when there is event sourcing at play. Now our consistency boundaries are set around a particular account and because the only thing we save during command handling are the resulting events, we’re not able to ensure global uniqueness just from the event streams themselves at the level of command handling.
What we can do instead is to handle the process of the user requesting a change of secondary account number in two steps. First we accept the command (granted it’s syntactically valid and all other conditions are met) and emit an event like NewSecondaryAccountNumberAssigned
turning this account into “change pending” mode. Next, on the read side we have a regular database table with a unique constraint like the one above - secondary_accounts
that attempts to store account numbers coming from read side listening for NewSecondaryAccountNumberAssigned
events.
Handling initial command and processing its event in read side
Handling uniqueness is a piece of cake now: if adding this secondary number succeeds, that means it’s not yet taken and we issue command like ConfirmSecondaryAccountNumberChange
(or more in notification-like style SecondaryAccountNumberAvailable
) back to our entity, which as a result of handling this message emits SecondaryAccountNumberChanged
. From now on the secondary account number is considered valid, unique and ready to be used.
Insert successful, no duplication. Send OK command/message
Otherwise if we fail on insert because of unique constraint violation, meaning the account number is already taken, we issue a command RejectSecondaryAccountNumberChange
(or again as a notification SecondaryAccountNumberDuplicated
) that results with an event like SecondaryAccountNumberChangeRejected.
In this case we invalidated the pending change in entity and probably inform user about that fact.
Insert failed, unique constraint violated. Send NOT OK command/message
We can also use this projection as a preliminary uniqueness validation, but because of eventual consistency we need to keep the entire process above still in place as this validation can only minimize (not eliminate entirely) chances of this happening, e.g. by improving UX and preventing users from submitting already taken account numbers.
Yes, it’s quite a dance, quite a lot of code to write as it requires coding down this entire process, but hey, it’s there if you need it. Should it be used blindly and everywhere, the same as event sourcing or virtually any other tool? Absolutely no. It’s all about tradeoffs - you give up somewhere to gain somewhere else. Know your options, make decisions that are well-informed and based on actual drivers at hand.
One more thing: if you feel uneasy about all these events we emitted above just to handle simple uniqueness scenarios, keep in mind that not all of them have to be public. Event sourcing is a local approach to state management and it’s your choice how you communicate with the outer world. Maybe you want other systems to know about every single attempt of secondary account number change or maybe you just want to signal the change only once you’re sure the change was ultimately applied. It’s probably for another chapter of the series, to cover public and private events that I’ll hopefully write soon.
Recap
As you could see, it’s not true that event sourcing (and CQRS) is all and only about eventual consistency. Sure it enables it and makes great use of its advantages when used in the right cases, but it doesn’t mandate it. As we’ve seen above, there are ways to build event sourced systems with strict consistency if required. But again, you have tools at hand, and it’s your job to know strengths and weaknesses of each and choose wisely.
Read more about Event Sourcing
In the next part we’ll look at possible storage options for our event sourced systems. Which one to choose and when. Stay tuned!
Reviewed by: Krzysztof Ciesielski, Marcin Baraniecki