Contents

Reactive Event Sourcing benchmarks, part 2: PostgreSQL

Reactive Event Sourcing benchmarks, part 2: PostgreSQL webp image

Let's continue our benchmarks from the first part. This time, we will focus on more pessimistic load generation. Pessimistic in the sense that our Gatling test scenario will try to simulate concurrent seat reservations of the same Show aggregate (the domain is described in another series). Previously, in our scenario, a virtual user createds a Show and then reserveds a seat after seat from this Show. This simulates a load where a chance of concurrent reservation is low and the major contention point was the underlying Event Store based on the PostgreSQL database. From a real-life perspective, it is a good way to simulate the load when there is a (partitioned by showId) message queue in front of the backend service. Generally, we should avoid using loops to increase the generated load because the recommended way is to launch more virtual users.

It's time to check how the Actor System behaves under a different load and put more pressure on the application layer. The testing scenario will be divided into two sections. We will create all available Shows as the first step and then launch a proper load, where many concurrent virtual users will try to reserve seats (and cancel the reservation). Before that, we need to explain what ‘concurrent’ means. Even for extremely popular shows, chances that two (or more) different users will launch the same request (or requests mutating the same aggregate) at exactly the same time (in milliseconds precision) are relatively low. With a 50 req/s load, all requests could be spread across 1000 milliseconds. In a perfect case - one request every 20 milliseconds. If the response times are below 20 ms, e.g. 5ms, we would not observe any contention and the load will be more linear (serial) than concurrent. That's of course not very realistic, but launching 50 req/s at the same millisecond is also wrong. My point is that we need to carefully design the load generation, not too optimistic and not too pessimistic. The best case is to somehow record the traffic from production, get statistics, and try to simulate the same with the load generator. Otherwise, we need to make some assumptions.

The first part is very easy.

private List<String> showIds = IntStream.range(0, howManyShows)
            .mapToObj(__ -> UUID.randomUUID().toString()).toList();

Iterator<Map<String, Object>> showIdsFeeder = showIds.stream().map(showId -> Collections.<String, Object>singletonMap("showId", showId)).iterator();

    ScenarioBuilder createShows = scenario("Create show scenario")
            .feed(showIdsFeeder)
            .exec(http("create-show")
                    .post("/shows")
                    .body(createShowPayload)
            );

After launching this, the code above our system will create all available Shows (10000 to be precise). Because 10000 Shows are not enough to run the simulation for a long time, we will reserve all the seats and then cancel all reservations in a sort of a loop controlled by a feeder (circular strategy).

ScenarioBuilder reserveSeatsOrCancelReservation = scenario("Reserve seats or cancel reservation")
    .feed(listFeeder(reserveOrCancelActions).circular())
    .doSwitch("#{action}").on(
    Choice.withKey(RESERVE_ACTION, tryMax(5).on(exec(http("reserve-seat") 
                                                     .patch("shows/#{showId}/seats/#{seatNum}")
                                                     .body(reserveSeatPayload)))),
    Choice.withKey(CANCEL_RESERVATION_ACTION, tryMax(5).on(exec(http("cancel-reservation")
                                                                .patch("shows/#{showId}/seats/#{seatNum}")
                                                                .body(cancelReservationPayload))))
);

The feeder itself is the most complex part.

final AtomicInteger counter = new AtomicInteger();

List<Map<String, Object>> reserveOrCancelActions = showIds.stream()
    .collect(Collectors.groupingBy(it -> counter.getAndIncrement() / requestsGroupingSize)) 
    .values().stream()
    .flatMap(showIds -> {
        log.debug("generating new batch of seats reservations for group size: " + showIds.size());
        List<Map<String, Object>> showReservations = prepareActions(showIds, RESERVE_ACTION);
        List<Map<String, Object>> showCancellations = prepareActions(showIds, CANCEL_RESERVATION_ACTION);
        return concat(showReservations.stream(), showCancellations.stream());
    })
    .toList();

private List<Map<String, Object>> prepareActions(List<String> showIds, String action) {
    List<Map<String, Object>> showReservations = showIds.stream()
        .flatMap(showId ->
                 IntStream.range(0, maxSeats).boxed()
                 .map(seatNum -> Map.<String, Object>of("showId", showId, "seatNum", seatNum, "action", action)))
        .collect(Collectors.toList());
    Collections.shuffle(showReservations);
    return showReservations;
}

For 1000 req/s, Gatling will spread the requests across 1 second = 1000 milliseconds, 1 request each millisecond. Actually, I should use the term virtual user per second, but since each user fires only one request, it's the same as requests per second. The Show aggregate is configured to have 50 seats (as in the previous post). If we generate 50 reservations for a given Show, all of them will be launched within a 50 milliseconds window. That's too pessimistic, we need to group and shuffle requests that mutate other aggregates. Our feeder for 1000 req/s and requestsGroupingSize = 5 will reduce the number of requests for the same aggregate to 20% within a 50 millisecond window. It might be hard to grasp it after the first reading, so let's visualise it for Shows with 5 seats.

50 ms50ms50ms50ms50ms
1,1,1,1,12,2,2,2,23,3,3,3,34,4,4,4,45,5,5,5,5x,x,x,x,x

Instead of launching all requests to the aggregate with id = 1, then with id =2, etc, within a 50 ms window for each (table above), we want to mix them with other requests.

50 ms50ms50ms50ms50ms
1,3,2,4,53,5,4,1,23,1,2,4,55,4,1,2,31,2,5,3,4x,x,x,x,x

At the end of the second, the outcome will be exactly the same, 5 (or 50) requests will mutate the same aggregate but we have more control over when exactly within this 1 second window the requests will be fired. It might look like a detail, but trust me, this changes the overall performance significantly. Play with it by yourself. Unfortunately, requestsGroupingSize must be adjusted to a given load, because with 2000 req/s, we would send 100 requests within 50 ms. To stay with a 20% chance of the same aggregate in a batch, we need to increase the grouping size to 10.

After reservations, the same strategy is applied to cancellations, which, from the performance perspective, are the same requests.

Concurrent reservation simulation

2 CPU cores

Setup:

  • e2-standard-2 (2 CPU cores, 8 GB RAM)
  • -XX:ActiveProcessorCount=2

Results:

req/s99th (ms) concurrent99th (ms) serialCPU (%) concurrentCPU (%) serial
500652828
7501154042
10001855855
12505067067
150013078180
17501490

image2

As expected, for concurrent reservations, the overall performance drops sooner than for serial load generation. Still, 1000 req/s under 18 ms (99th percentile) is not so bad for 2 CPU cores machine.

3 * 2 CPU cores

Setup:

  • 3 * e2-standard-2 (2 CPU cores, 8 GB RAM)
  • -XX:ActiveProcessorCount=2

Results:

req/s99th (ms) concurrent99th (ms) serialCPU (%) concurrentCPU (%) serial
10001162531
15001473840
20001775052
25002196263
300080127570
3500100168280
40003082
45008085

image3

This time, the sweet spot is around 2500 req/s vs 3500 res/s with serial load generation. The throughput improvement factor after scaling to 3 nodes is 2.5, which is a very good result.

4 CPU cores

Setup:

  • e2-standard-4 (4 CPU cores, 16 GB RAM)
  • -XX:ActiveProcessorCount=4

Results:

req/s99th (ms) concurrent99th (ms) serialCPU (%) concurrentCPU (%) serial
500441413
1000642827
1500953836
20001364743
25002075853
300076106258
3500150205848

image1

The best performance is close to 2500 req/s with 20 ms 99th percentile of response time.

3 * 4 CPU cores

Setup:

  • 3 * e2-standard-4 (4 CPU cores, 16 GB RAM)
  • -XX:ActiveProcessorCount=4

Results:

req/s99th (ms) concurrent99th (ms) serialCPU (%) concurrentCPU (%) serial
2000952527
30001573635
40002584645
50006095350
6000130126055
70002060
80003057

image4

The throughput improvement factor is only 1.6 (4000 req/s with 3 nodes vs 2500 req/s with 1 node). In the last simulation, the difference between serial vs concurrent load generation is the biggest, 7000 req/s vs 4000 req/s.

Conclusions

A proper load generation strategy is the key to understanding our application performance characteristics. Intentionally, I've picked two different ones to show you the discrepancies in the results. I've called them serial and concurrent, but that's not entirely true. In the case of the serial strategy, not all requests will be fired one after another. Only the ones to a given aggregate, but simultaneously, we are probing many aggregates. The concurrent strategy to some degree can be seen as a serial one.

Nevertheless, understanding your domain is necessary to simulate the load correctly. In some systems, the load will spread across all aggregates evenly. However, very often a few aggregates are exposed to 80-90% of the traffic. Fortunately, with Gatling API, we can benefit from its flexibility and model the load as close as possible to the production environment. It will never be 100% the same, but at least it won't be completely detached from reality. I will repeat myself, but from a technical perspective, my advice is to rather avoid using loops for the load multiplication (unless you know that it's the correct way). Use virtual users instead. For HTTP traffic, don't forget about the shareConnection option. Enable it only if there is a connection pool between your application and the client (a gateway is also a client). You would be surprised how costly it is to create a new connection for each request, it's one of the performance killers.

An interesting observation from our tests is that if we put a message queue in front of our service (serial generation strategy), we should expect an increased throughput. From the application client perspective, the latency will be higher, but we can hide it with a different communication pattern (async vs sync). It's worth remembering this benefit.

Even with more contention on the application layer, the Akka stack performs really well with a relational database like PostgreSQL. Even with a high load, there is still some CPU space to handle other application activities. The plan for the next part is to play with the pluggable architecture of Akka Persistence and check the performance with a distributed database like Apache Cassandra. Switching to another Event Store is just a matter of a different configuration. No changes to the existing code are required.

If you would like to check out the full source of Gatling tests, they are available here.

Blog Comments powered by Disqus.