Benchmarking Tapir: Part 3 (Loom)
Introduction
Recently, I wrote two blog posts about Tapir performance benchmarks: Part 1 and Part 2. These tests allowed us to implement many significant optimizations (see GitHub milestone here). As a bonus, I’d like to use all the tools and knowledge I gathered during this journey and take one more deep dive to investigate two special Tapir backends: tapir-netty-loom and tapir-nima. These servers leverage Java Virtual Threads introduced in JVM 21, which results in an entirely different runtime used to manage concurrency. The tapir-netty-loom server shares many parts of its implementation with other flavors of Netty-based backends but doesn’t use Scala Futures. Instead, it runs asynchronous operations on virtual threads, allowing writing the server logic in direct style, without wrapping results into Future[T], as long as this logic is composed of non-blocking operations or async operations, which also leverage virtual threads, depending on used libraries. Similarly with tapir-nima, which is built on top of Helidon Nima, a framework with allegedly outstanding performance.
We are going to compare the following backends: http4s-vanilla, pekko-vanilla, vertx-vanilla, tapir-http4s, tapir-pekko, tapir-vertx (Future-based Vert.X server), tapir-vertx-ce (Tapir Vert.X backend using Cats Effect), and finally, our fresh contenders: tapir-nima and tapir-netty-loom.
A vanilla backend is a simple set of HTTP endpoints written with the given library without wrapping in Tapir.
Testing methodology
I ran the following test for each backend:
- A simple GET request (
SimpleGet
) - A POST request with a small byte array body (256B) (
PostBytes
) - A POST request with a large byte array body (5MB) (
PostLongBytes
)
All scenarios are preceded by 5 seconds of a warm-up execution when results are not taken into account. Then, they run on 128 concurrent users, ramped up within 5 seconds (to avoid latency issues in the first seconds of the test), and run for 30 seconds with a full user count.
Tests were executed on a Linux PC with AMD Ryzen 9 5950X (32) @3.4GHz, 64GB RAM, using JDK 21, Scala 2.13 (forced by Gatling), sbt 1.9.8.
I tested Tapir 1.10.0, which already includes many performance tweaks implemented after previous performance test sets. See the GitHub milestone for details.
Results
SimpleGet
First, let’s see what was the average throughput achieved by each backend:
(I analyzed Gatling reports to confirm that the throughput wasn’t too variable during the entire 30 seconds).
As we can see, tapir-nima is the new leader in the test, while tapir-netty-loom is the second fastest! Let’s analyze the latency distribution for this test:
All backends keep their GET endpoint latency between 3 and 7 ms up to 99,9%, but then an outlier starts to stand out - tapir-http4s, together with the vanilla http4s server. Some requests execute for much longer, and these requests are pretty evenly distributed over the test, as can be confirmed in detailed Gatling reports. Let’s take http4s out of the chart and plot latency distribution again:
All the servers stay under 20ms up to the 99.999th percentile. The tapir-nima server stays very strong by keeping it under 7ms. It also looks good for tapir-netty-loom, which didn’t cross 13ms, representing the third-best performance.
PostBytes
Results look surprisingly different if, instead of GETs, we start to send POST requests with byte arrays (256B), which are read into memory before sending a response. First, the mean throughput:
In this scenario, tapir-nima and tapir-netty-loom have lower throughput than several other servers. What about latencies?
Now that we have confirmation that tapir-netty-loom suffers from latency issues, the next step could be running PostBytes with a profiler and looking for hot frames. Other backends stay between 25 and 50ms for the 99.999th percentile, with a clear, impressive outlier of tapir-vertx-ce. Tapir-nima is second best until 99.99%, when it becomes slightly slower than tapir-pekko and vanilla http4s.
PostLongBytes
Our last test sends data to the same POST endpoint, but this time, it’s 5 megabytes that have to be processed and loaded into a byte array in memory before sending a response.
Both tapir-netty-future and tapir-netty-loom lead in this test. Underneath these backends use the netty-reactive-streams integration, which efficiently processes larger byte streams. The tapir-nima server is again significantly slower, probably due to the mentioned direct method of obtaining the byte array. What’s surprising is that tapir-vertx-cats now becomes the slowest, although it was one of the best backends for short input arrays. Perhaps its way of obtaining data (calling routingContext.body.buffer().getBytes) performs poorly when dealing with larger buffers. Let’s analyze the latency distribution:
We can see that tapir-nima stands out already at the 90th percentile and becomes significantly worse than the rest on higher percentiles. Other servers are pretty consistently distributed between 430ms and 700ms, where the lowest latency is observed for http4s, and tapir-netty-loom ends on the higher end of this range.
Conclusions
The loom-based backends show good performance characteristics but need to be optimized to process incoming request data more efficiently. They show outstanding characteristics for processing simple GETs, so there’s a good chance to make their POST processing more robust.
Additionally, we confirmed that leading Tapir backends do not lag behind their vanilla counterparts regarding both latency and throughput, which is another strong confirmation that Tapir is a safe and performant library for demanding projects.