Contents

Things I wish I knew when I started with Event Sourcing - part 1

Michał Ostruszka

28 Jun 2024.14 minutes read

Things I wish I knew when I started with Event Sourcing - part 1 webp image

Over the last several years, I have had a great chance to learn and work with various services built on event sourcing. Sure, there were times when our relationship was more “hate” than “love” from the love-hate duo, and there were times when event sourcing was overused or at least not needed.

After all these years I see it as a really powerful, useful and definitely underappreciated tool in an engineer’s toolbox. Still, it’s just yet another approach, it is meant to help in some relatively narrow aspects of our engineering work, not a silver bullet that promises to solve all the project issues and challenges we face.

In this upcoming series, I’d like to take you through the lessons I’ve learned so far, as well as my findings, realizations, and surprises that got me over these years of using event sourcing in systems I was building. In this article, I’ll describe some of the fundamental strengths of this technique and how I learned to appreciate them. Later on, I’ll write more about challenges and a few gotchas that I experienced. With that said, let’s go!

Time-traveling superpowers

Let’s start with probably the most obvious and the most popular trait of event sourcing and usually the main point of people trying to “sell” it.

You only appreciate the real power of time traveling when you have to troubleshoot and investigate a bug that somehow sneaked into the production. Suddenly for some users your service goes bananas, they either cannot access a specific feature or it displays garbage data.

You put your detective hat on and first check all the metrics and system health checks - all good there. Then quick look at the recent logs, but again, there is nothing suspicious other than the messages saying that there is something wrong with this feature for a given user.

In “traditional” systems, with state-snapshot-based persistence, the next thing you probably usually do is something like SELECT * FROM accounts WHERE user_id = 123 and see what’s inside in the hope of getting an answer like “Oh, the state is incorrect.” Maybe you immediately spot what’s wrong and know how to fix that with UPDATE accounts SET … WHERE user_id = 123, but what if that’s not the case? How do you figure out why, how, and when the entity landed in this incorrect state?

The only thing you have at hand is the current state snapshot, and maybe, if you’re lucky (and somewhat log-paranoid like me), you have a bunch of old logs. But sometimes, it may not be enough; you are getting blind, and here’s where the investigation gets tricky. It’s not to say it’s impossible - we’ve all been there; usually, there is a way to fix this; it may just be a bit more difficult to connect all the pieces.

If your state persistence is built on event sourcing instead you suddenly have way more context and data for the investigation. You have all the facts that happened to the entity over time, facts that caused its all state changes, time ordered, with all the details baked in.

Then you can start traveling back in time, replay all the events one by one up to a certain moment (e.g., 2 weeks ago), and see the resulting state back then. Is it ok? Then replay some more until you end up in the wrong state. Is it already the wrong one? Go back and pinpoint the event (or events) that caused this state corruption in question.

Together with the sequence of a few more events before and after you can build the full picture of what happened and why. Congratulations! You now have a bigger picture of what most likely caused the issue. Fixing the place in the code where your logic is incorrect (maybe an unsupported edge-case?) should be relatively easy now.

This looks like a Hollywood story, doesn’t it? It’s not all roses, though, and there are a few things that I didn’t mention for the full picture yet. How exactly do you apply this time traveling on production data? How do you inspect the intermediate states if the only thing you have is a sequence of events, sometimes even in a binary and not any human-readable form (more on that later in the series)? And finally, how do you fix the bug for the users already impacted - do you change the past? Let’s go through these one by one.

Invest in and build your project tooling

How can you inspect the state after each event in sequence for troubleshooting purposes if there is no concept of “persistent” or “materialized” state in event sourcing?

Basically, you do the same drill your entity does when replaying its events stream for command handling. But while the entity replays all the events from the beginning to the very end in order to reach the latest state to handle command with, you want to stop earlier, before or after some particular event, before or right after some defined timestamp, etc.

Such a devtool is usually the first one I build when working with event-sourced services. It’s also one of the first ones I reach for when troubleshooting these kinds of issues. Typically, it’s some kind of command-line tool or an easy-to-run program living next to my production code, so I can replay events the same way as my production entities do it, just limiting the sequence of events fed into it. I run the tool against production data either via a dump of a given entity event stream loaded into the local database or by accessing a read-only production database if possible. It takes faulty entity id and either the sequence number of event or timestamp to replay the event stream as the arguments, and it uses the same code as production entities to fetch, decode, and apply events. Once it reaches the limit, it simply prints out the current entity state on the console. I sometimes add one more extra feature that, after reaching a certain event, allows for applying the next one in the stream only “on key press,” printing out the state after each event. It’s like writing your own little debugger for your entities.

Even if your events persist in binary format, by making use of the real event handlers machinery, you don’t have to deal with that fact explicitly and decode them manually again. With such a tool in place, you can easily travel in time and see what the exact entity state was, e.g. 2 weeks ago, and how it changed after every event. It’s just reading a bunch of records from the database; no side effects get invoked on the event application, so you can do as many trials as you need to get to this faulty event.

Later, usually the project’s dev toolbox grows bigger and bigger and some specific, ad-hoc views/projections are built to quickly answer some common questions. There are some additional tools around entities that help with daily work, especially with troubleshooting, but the one for time-traveling and debugging event streams is usually the first one I build.

Correcting the past

Now that you have identified the issue, it’s time to fix it. You know the drill: make a code change, write and run a bunch of tests, wait for the CI pipeline to finish, and ship it. But that’s only half of the story. You now only have correct behavior that’s gonna apply to commands coming to the system from now on and new events that will be generated in response to these commands.

What about entities that were already impacted by the bug, which have these faulty or simply incorrect events in their streams? It’s no longer as simple as issuing an UPDATE statement on the bunch of database rows because there is no actual state. Remember you cannot change the past, events are immutable facts you should not alter.

But actually, what’s the rationale behind that? At first sight, it sounds like a good approach - overwrite the faulty event with a proper one in the event store, and you’re good to go. First, if you do that, you have to make sure that the new version of the event can actually be applied properly to the corresponding state, meaning the event makes logical sense in the sequence. But that’s the easier part; this one can usually be verified. The less obvious and usually invisible gotcha is the “locality” of this manual intervention.

Aside from the writing part, event-sourced systems also have so-called read sides, parts of the code that consume streams of events written by your entities and acting on them. The actions they take can range from building some views/projections on the data (e.g., reports, graphs, etc.) through building specialized read models in external, specialized databases required for some more or less critical business decisions to regular event handlers explicitly calling other services or publishing some messages (or “public events”) to the queues to drive other parts of the system.

All the downstream components rely on the logical consistency of your event streams and the fact that if an event is saved, it’s there forever, unchanged. That means all the reports and read models have already consumed that event and applied it on their side (probably resulting in wrong figures, and that is maybe how you found out about the original issue itself). All the event handlers have consumed the event as well, calling other systems or publishing messages relying on this event’s data. Once they’re done handling an event, they move along and never go back, so they don’t even notice you manually changed one of the past events in the stream.

While you “fixed” the event itself, you didn’t fix all the parts that were impacted by this event, which was wrong. Sure, you can try to alter or rebuild your read models manually, and re-trigger other service calls with proper payloads, but are you sure you know and control all the downstream consumers of your events? Also, what about data consistency in case of possible audits?

Another strong selling point of event sourcing and its event log is the ease of auditability, which you get basically for free. What if you’re in a strongly regulated environment or just need to be audited regularly? Such data inconsistencies introduced manually will pop up sooner or later during the audit, and you’ll have to deal with the rather unpleasant consequences. Allowing updates to past events breaks any trust an auditor could put in the event store. Immutability is a hard requirement of any such log-type data, especially if it needs to comply with regulations.

Don’t get me wrong—overwriting events is not something you cannot do at all. There are services and cases where it may be totally ok to do that, where there are just a few downstream consumers you fully control. They’re easy to correct, as are the read models; the system is not “regulated,” etc. I myself have done it a few times so far—quite a stressful experience doing surgery on the history, but it worked out well in all these cases.

Anyway, the right and the most elegant way of handling such issues though is to be explicit with the correction. That means issuing a healing/compensating command to the impacted entity that, as a result, saves a compensating event that corrects whatever was wrong with the state before.

Put whatever you need in the event payload, corrective values, reference the event/action you’re correcting, it’s up to you. Yes, you can argue it’s a form of overengineering as it usually takes some time and needs new code to make it happen (and it’s sometimes one-off code), but hey, it’s all about tradeoffs. Let’s see what you get in an exchange if you do so.

By publishing this new type of healing/compensating event you first make the fact of the correction explicit, an event is persistently recorded in the event stream - no more inconsistency threats in case of audit etc.

As a consequence, you somehow “force” all the read models and event handlers to deal with this new event explicitly. Sure, they may choose to ignore it, but at least they need to acknowledge they got it. They may handle it by, e.g., correcting their report figures, sending corrected messages downstream, etc. - it’s up to them.

You made your correcting decision explicit, stating a fact that happened to the entity, and whoever was interested in this entity state changes had to choose the right way (in their context) to react to it.

On the other hand, it may be that your entity logic is perfectly fine and the bug you encounter lies exclusively on the read side, e.g. some reports generate wrong data, some views don’t display what they should etc. There is usually a relatively easy and safe way to fix these (putting aside the inherent complexity of figuring out the change itself).

Remember your event streams are a single source of truth, and you can rebuild state at any moment in time just by replaying events and looking at the resulting state from a perspective of your choice. This is exactly how read models are powered; they simply replay event streams and interpret incoming events the way they want.

So you simply need to fix your faulty projection (read-side) and let it regenerate from the very beginning (or from some later point before the issue happened). Sure, it may take some time, depending on the number of events to process, and you may need to figure out how to do a graceful switchover if needed, but basically, it’s as simple as that.

By having the entire history of changes in your entities, you’re free to replay it either from the start or from some point in time every time you want.

Architecting for tomorrow’s questions

Speaking of the event streams being the source of truth and potential of virtually “free replays”, there is a quote that striked me some time ago. It’s by Anita Kvamme and I first saw it being quoted by Oskar Dudycz (does it count as second-level-metaquote now?)

 

It basically shows the greatest (at least to me) strength of event sourcing in one sentence. I’ve had cases several times when, after some time of service running on production, there was a request from the so-called “product team” coming, asking if we can tell this or that based on the activity and actions in the service by then. They usually wanted some specialized view of data. e.g., how many, how much, how often, what’s the sum, total, average, trend, etc.

At first glance, it all looks like a task for regular business metrics you should have baked into your system regardless of how it’s built. And that’s totally right; some of these aspects can be read from metrics. But the thing is, you cannot anticipate all the possible metrics that you will need in the future.

You can only code against what you know at the moment. You cannot guarantee that every business-important change has a corresponding metric in place, as metrics are just extra side steps that are optional and not necessary for the process to work. On the other hand, in order to support the business flow as is, with event sourcing, you need to save all the information required to transition from one state to another in the form of an event stream.

That means you potentially capture all the relevant information about the changes, which in turn means you can view that data from various angles, depending on the business needs later on. Even if some reports were not envisioned before, if you only have the data, it’s a matter of building a new read model or projection that will answer the questions asked by focusing on certain aspects of the data.

This exact thing has happened to me many times so far; there were either ad-hoc reporting requests from the business we could satisfy with the data from events, other times, there were new read side event handlers built to push a subset of our historical data to external systems for some new business feature, etc.

Generally speaking, the longer the service lives and the more data you have, the higher the chance that this data will be useful for making new business decisions that you didn’t even think of when building the service initially.

Summary

That’s it for now. In the first, somewhat lengthy part, I focused on some not-so-obvious features of event-sourced systems when it comes to the way the state is represented there. I have also mentioned the ability to travel in time, focusing on how to do it and what this fact means to engineers, for troubleshooting and bugfixes, and to product people, for growing the business by getting more insights into the data.

In the next part, I’ll focus on practices and some gotchas of modeling event sourced systems, basically how to increase your chances of not shooting yourself in the foot while building one. In the part 3 I'll examine the possible event persistence options available, with consistency being one of the key factors.

Stay tuned!

Reviewed by: Bartłomiej Żyliński, Krzysztof Ciesielski 

Blog Comments powered by Disqus.