Reactive Event Sourcing benchmarks, part 1: PostgreSQL

Reactive Event Sourcing benchmarks, part 1: PostgreSQL webp image

This post starts a new series focused on benchmarking Event Sourcing solutions, described in the previous Reactive Event Sourcing in Java series, where we implement an Event Sourcing solution based on the Akka stack in the Java language. Some of my clients are afraid that Event Sourcing might be too slow for them and a classic approach to state persistence will be faster - nothing could be farther from the truth. It's time to validate the performance of an Event Sourcing application in several different business scenarios and hardware configurations.

Disclaimers

As with any performance test, the numbers are valid only for a given context. Your context will be different. You will have a different event model, different business logic, and a different hardware setup. The idea behind the tests below is to show you what you should expect from a reactive implementation. Our goal is to deliver a low latency system. What is the definition of a low latency solution? Opinions are divided. From my perspective, measuring latency is not enough, performance is also a matter of expected throughput and bearable costs. Let's assume that for RDBMS as an Event Store, anything above 50ms for the whole HTTP request processing will not be acceptable.

Setup

1. Application

A simple Event Sourcing application from a GitHub repo based on Spring Webflux (controller layer) and Akka (application layer), the code is exactly the same, but:

  • Java serialization is replaced by Protocol Buffers from Google;
  • dependencies are updated to the latest versions (Akka 2.6.19, Spring Boot 2.7.1);
  • for tests, I'm using only 2 endpoints which mutate the state:
    ▪️ CreateShow command
    ▪️ ReserveSeat/CancelSeatReservation commands

2. Runtime

  • everything except the database is dockerized and deployed on k8s
  • Java 17, -Xmx4G, -XX:ActiveProcessorCount=number_of_available_cpus
  • Hikari JDBC connection pool with max 10 connections
  • Gatling shared connections enabled

3. Hardware

  • a dedicated host for the application pods and a separate host for the Gatling performance test to avoid interference between pods
  • compact placement for GKE nodes
  • CPU power is not limited for a pod, but the pod will be limited by node configuration
    ▪️ for 2 CPU cores the allocatable CPU is 1930m
    ▪️ for 4 CPU cores the allocatable CPU is 3920m

4. Database

  • Google Cloud SQL with Postgres 11
  • 4 CPU cores, 8G memory

Serial reservation simulation

The first test scenario is very simple.

scenario("Create Show and reserve seats")
    .feed(showIdsFeeder)
    .exec(http("create-show") //1
          .post("/shows")
          .body(createShowPayload)
         )
    .foreach(randomSeatNums.asJava(), "seatNum").on( 
    exec(http("reserve-seat") //2
         .patch("shows/#{showId}/seats/#{seatNum}")
         .body(reserveSeatPayload))
    );

A virtual user creates a Show with 50 seats (//1) and then reserves seat by seat in a serial way (//2). Why 50 and not 100 or 300 seats? TBH it doesn't matter. With 50 seats, I can produce more ShowCreated events. The ShowCreated events’ payload after serialization is much bigger (1138 bytes) than the rest of the events (~50 bytes). A careful reader will notice that this test scenario is not very realistic because, in real life, many users will try to reserve seats for the same Show concurrently. That's true, let's assume that in our business, chances of concurrent seat booking are pretty low. We will cover different business scenarios in a separate blog.

2 CPU cores

Setup:

  • e2-standard-2 (2 CPU cores, 8 GB RAM)
  • -XX:ActiveProcessorCount=2

Results:

req/s99th percentile (ms)CPU (%)
500528
750542
1000555
1250667
1500780
17501490
20003188
2250200+85

2 CPU cores results

Each step takes 5 minutes and adds 250 req/s. With only 2 cores and 90% CPU utilization, we are able to get 14 ms latency (99th percentile) for 1750 req/s. Above that, we can observe a performance degradation. The database CPU usage is under 20% for the whole test.

3 * 2 CPU cores

Setup:

  • 3 * e2-standard-2 (2 CPU cores, 8 GB RAM)
  • -XX:ActiveProcessorCount=2

Results:

req/s99th percentile (ms)app CPU (%)DB CPU (%)
100063112
150074018
200075225
250096330
3000127037
3500168042
4000308247
4500808550

For clustered (3 pods) deployment of our application, the best performance is close to 3000-3500 req/s with 16 ms latency. With more load, we are putting more pressure against the underlying database, that's why there is an additional column with DB CPU usage percentage. Clustered setup is a bit tricky because there is a 66% chance of an additional network hop between pods for a given request. This of course affects response times. Based on some approximation, this could be from 2,5 to even 30 ms (99th percentile) of an additional cost of sharding.

sharding cost

That's a lot and the worst thing is that it grows with the generated load. Based on that experiment, you could say that Akka Sharding is pretty expensive and it's questionable if we should use it. The problem here is the CPU power. If we want to play with low latency, then 2 CPU cores are simply not enough to handle incoming HTTP requests, DB queries, and network calls between nodes. Let's add more cores.

4 CPU cores

Setup:

  • e2-standard-4 (4 CPU cores, 16 GB RAM)
  • -XX:ActiveProcessorCount=4

Results:

req/s99th percentile (ms)app CPU (%)DB CPU (%)
5004146
100042711
150043716
200054522
250075327
300086332
3500126332
4000205225
4500704415

4 CPU cores results

Above 3500 req/s, we can observe a small performance drop. Interesting that a single 4 CPU cores machine can replace 3 nodes with 2 CPU cores (the 99th latency is more or less the same 12 ms vs 16 ms). Vertical scaling is not always a bad idea. Especially in the case of JVM, which performs better with more cores. Another thing to notice is that CPU usage goes down at the last steps. We will get back to this observation in the summary.

3* 4 CPU cores

Setup:

  • e2-standard-4 (4 CPU cores, 16 GB RAM)
  • -XX:ActiveProcessorCount=4

Results:

req/s99th percentile (ms)app CPU (%)DB CPU (%)
200052724
300073535
400084546
500095056
6000125564
7000206066
8000305763

3* 4 CPU cores results

Now we are talking. The sweet spot for this setup is between 6000-7000 req/s with 13-20ms latency. The cost of an additional network hop is from 1 to 4 ms.

sharding cost with 4 CPU cores

My rough experiments show that we could stabilize this metric more if we replace Spring Webflux with Akka HTTP so that we don't have two different reactive worlds and translation between them. With traffic around 6000 req/s, we should expect 1-2 ms for sharding.

Conclusions

Scaling reads is a relatively easy topic. We just need some sort of a cache. It could be a simple in-memory structure or something fancier like Redis, Memcache, etc. Scaling writes, especially writes that must be consistent, is another pair of shoes. Consistency requirements very often force us to use optimistic (or pessimistic) locking. With relational databases, it's a straightforward task, but not all databases support such locking mechanisms. This article shows that you can address this problem with a different strategy. Thanks to the Actor Model, you can achieve pretty high throughput and low latency, using cheap hardware under the hood. I will repeat myself but remember that you should run your own benchmarks against your specific domain and hardware. All the results above are really great (for a given setup) because of the specific load generation, which kinda simulates a message queue in front of a backend service (not always the case). Make sure that you read the second part of the article to see the results with a different load generation strategy. Generic benchmarks are just to give you some approximations and set your expectations.

There is a direct correlation between database access time and HTTP response time. After some threshold, the database is our main point of contention. That's natural and in the case of optimistic locking, we would hit this threshold much faster. The additional logic required by Event Sourcing is irrelevant from the performance point of view (in this context). Actually, thanks to Event Sourcing and append-only writes (enforced by this approach), we can get very fast database write operations. It's worth noticing that the RDBMS-based Event Store can be surprisingly fast, even on moderate hardware. You might argue that testing only writes is not very realistic. It depends on what you want to test. With the Akka stack, once you load your Event Sourced Entity into memory (this process could be optimised and adjusted with snapshotting and passivation), reads from the write-model are an extremely fast operation. Everything is served directly from memory. Because no database access is required (well you need to access the database once), reads will skew writes percentiles (in a good way) and the overall performance will be even better. The second reason to focus only on writes was that with Event Sourcing, you will naturally embrace a CQRS pattern. Based on the events stream, you can build any read model optimized for a specific query. For effective events streaming from your RDMS Event Store, you could for example use CDC (Change Data Capture).

An interesting observation is that with too extensive load for a given setup, the CPU is not overloaded. Actually, the usage goes down.

reactive application overloaded

Thanks to the reactive approach, we are consuming our precious resources more efficiently. We can't serve more mutation requests, some of them will get timeout instead of responses, but the application is still responsive. We can e.g. handle GET requests that do not require any DB access.

Can we squeeze out more from this setup? Probably yes - since the CPU usage was close to 60% (for 4 cores machines). Such fine-tuning is very time-consuming and I would like to focus more on different use cases. In the next article, we will rerun all these tests with a more realistic business scenario, where many users will try to reserve seats for the same Show. We will try to impose more contention on our application layer. Stay tuned!

For the record, yes, I know that the 99th percentile is not enough. I should use 99.9 or 99.99. At the same time, if we want to focus on higher percentiles, we need better hardware. It's always a dance between throughput, latency, and costs.

Running performance tests without the help from SoftwareMill's experienced DevOps developers would be an extremely tedious task. Big credits to Grzegorz Kocur and Marek Rośkowicz for Kubernetes lab environment setup.

Blog Comments powered by Disqus.