Benchmarking Tapir: Part 2
In the previous part, I analyzed how a combination of Tapir+low-level server handles high load of HTTP requests in comparison with using unwrapped, “vanilla” servers directly. It allowed us to identify a few bottlenecks, as well as confirm Tapir’s low overhead in many cases. In this part, I’ll run some of these tests with a profiler to analyze how Tapir affects CPU usage, I’ll also check how much additional memory it consumes. Finally, I’ll test WebSockets in http4s, pekko-http and Vert.X backends, analyzing latency distribution.
Reflections on part 1
After getting feedback from the first series of tests, as well as preparing this second part, I gathered some insights:
- Performance tests can fall prey to the Coordinated Omission Problem, which can be mitigated by ensuring a high number of concurrent users. When running part 2, I also checked Gatling’s reports showing throughput distribution over time, confirming no such omissions.
- Measuring latency using Gatling gives us percentiles up to p99, but some important issues may be spotted only on higher percentiles. I added a High Dynamic Range histogram which collects up to 99.9999% in our tests, and used it exclusively for WebSocket tests. Latency comparison wasn’t a part of our previous analysis, but it will probably be revisited in the future with this more precise tool.
Profiling
Testing Methodology
I tested the same backends as in the previous part: http4s, Pekko-http, Play, Vert.X, and Netty. The last two are additionally divided into sub-flavors: Future-based and Cats-effect-based implementation. I ran the following test for each backend:
- A simple GET request (
SimpleGet
) - A simple GET request to /path127 (
SimpleGetMultiRoute
) - A POST request with a small byte array body (256B) (
PostBytes
) - A POST request with a large byte array body (5MB) (
PostLongBytes
)
The SimpleGetMulti test is a special scenario, which sends a GET /path127
request to a server with 128 endpoints named /path0 /path1, …, /path127. Part 1 has revealed that routing to such a “distant” endpoint may be pretty expensive for some backends, while Tapir corrects for this cost by optimizing with its unique path pre-matching mechanism.
All scenarios run on 128 concurrent users for 30 seconds, preceded by 5 seconds of a warm-up execution, the results of which are not considered.
Tests were executed on a Linux PC with AMD Ryzen 9 5950X (32) @3.4GHz, 64GB RAM, using JDK 21, Scala 2.13 (forced by Gatling), sbt 1.9.8.
For profiling, I’m using IntelliJ IDEA with a built-in Async Profiler, additionally reading memory consumption from generated jfr snapshots using Java Mission Control.
Let’s now analyze the results for the mentioned tests and servers.
http4s: SimpleGet, SimpleGetMultiRoute
Vanilla http4s shows CPU reaching 50%, with frequent drops to 15-20%.
Memory usage reaches 900MB in both cases.
tapir-http4s
CPU usage looks similar, although a bit lower, with smaller drops in the SimpleGetMultiRoute test, which isn’t very relevant.
Memory usage stays on the same level, occasionally rising to 1GB.
Analyzing Flamegraphs
Flamegraphs are very useful visualizations showing what percentage of samples was taken in different code locations, allowing to identify “hot frames”. A flamegraph can also filter tapir-specific frames and show how much in total they take. For example in the SimpleGet + SimpleGetMultiRoute test, tapir frames consume 10.38% of CPU samples. However, this number doesn’t tell us very much about the overhead. Some Tapir code replaces native backend code (like request/response encoding and decoding in some cases). The total CPU consumption may be then useful only when comparing two tests over similar Tapir backends, like Tapir vs Tapir+DefaultServerLog
interceptor.
Individual hot frames on the chart may sometimes be worth investigating on their own, let’s see what takes CPU when Tapir is added.
Disclaimer: each frame represents samples registered when CPU was busy in the given method. The wider, the more samples it contains, with sub-frames on top of it representing a stack of its constituents. Beige frames are Tapir’s code, while heather color represents the rest (java and libraries).
DecodeBasicInputs
is part of basic Tapir logic so we expect to see it occupying the CPU a bit. We are going to see a pattern of such a flame growing on top of Tapir’s ServerInterprete
r very often. Similarly, a secondary flame (or even 2 flames) calling sttp.tapir.server.interpreter.ResultOrValue
- they are going to be the main pattern. Let’s dive and see if child frames of such frames can contain some CPU-occupying code that’s worth optimizing.
If anything noteworthy, in case of the GET tests we ran into Http4sServerRequest.acceptContentTypes
occupying 1.72% of the CPU. Indeed, underneath, there’s some pretty complex parsing logic from sttp-model library, parsing the “Accepts” headers. It’s some CPU cost, but within reasonable ranges, so let’s not treat it as an immediate target for optimization.
tapir-http4s + DefaultServerLog Interceptor
The first part of my blog post shows that Tapir’s DefaultLogInterceptor significantly affects throughput. Since it’s turned on by default, it’s an important part to fix, especially when we look at https://github.com/softwaremill/tapir/issues/3272, where Kamil Kloch analyzes this impact in detail.
Running SimpleGet and SimpleGetMultiRoute tests and looking into the flamegraph shows:
That’s quite a hit, so let’s deconstruct this offending frame, to find that the most expensive sub-frames are:
sttp.tapir.server.http4s.Http4sServerRequest.uri()
7.64% (!)sttp.model.Uri.toString()
1.96%sttp.tapir.Endpoint.showPathTemplate(Function2, Option, boolean, String, Option, Option)
4.44%sttp.tapir.Endpoint.method()
1.75%
Voila, we have our first set of candidates to improve, registered as GitHub issues 3545 and 3546.
http4s: PostBytes, PostLongBytes
These two tests analyze Tapir’s performance hit when receiving byte arrays, especially in cases where they need to be built from multiple parts, possibly incorporating underlying streams.
http4s-vanilla
Vanilla http4s shows CPU busy at 8-11% for PostBytes and up to 60% for PostLongBytes, with frequent drops even down to 35%.
Consumed memory is around 120 MB for PostBytes, while PostLongBytes makes it keep reaching levels near 2GB.
tapir-http4s
CPU for PostBytes looks similar to the vanilla run, but for PostLongBytes it’s quite different, reaching no more than 50% in highs and deep to ~20% in lows.
Memory usage reaches 256MB for PostBytes, while PostLongBytes it consumes 2.2GB.
Let’s take a look at the flamegraph.
Here’s an interesting example showing how we may get misled by thinking that Tapir’s code adds a lot of overhead. At first, we can see that sttp.tapir.server.http4s.Http4sRequestBody
takes a lot of CPU, essentially running fs2.Chunk.toArray
underneath.
However, the vanilla server also needs to run fs2.Chunk.toArray
at some point, so we shouldn’t really interpret Tapir’s code as extra work:
In such case we should look for hot frames elsewhere, but nothing extra shows up in addition to what we’ve already noticed in SimpleGet and SimpleGetMultiRoute tests.
pekko-http: SimpleGet, SimpleGetMultiRoute
vanilla-pekko-http
Starting with SimpleGet and SimpleGetMultiRoute tests, we can see much higher CPU pressure than in our http4s tests!
Memory consumption reaches 512MB for SimpleGet, and 1.5GB for SimpleGetMultiRoute.
Let’s quickly look at the flamegraph:
The culprit lies in the way pekko-http deals with multiple routes concatenated together to do the matching. In the first part, we saw how seriously it affects throughput. Now we can confirm that it’s indeed the routing that is responsible, additionally eating a lot of CPU unnecessarily.
tapir-pekko-http
Memory consumption seems lower with Tapir, stabilizing around 380MB for SimpleGet and 1.5GB for SimpleGetMulti. With Tapir, CPU load in SimpleGetMultiRoute test seems to be gone. Let’s confirm by looking at the flamegraph:
Indeed, routing doesn’t seem to be a concern anymore. Apart from the standard minimal overhead caused by the components ServerInterpreter, we see a few hot frames when preparing a response. They may be worth optimizing (GitHub issue):
sttp.tapir.server.pekkohttp.PekkoServerRequest.acceptsContentTypes()
2.4%sttp.tapir.server.pekkohttp.PekkoToResponseBody.parseContentType(String)
3.53%sttp.tapir.server.pekkohttp.PekkoModel$.parseHeadersOrThrowWithoutContentHeaders(HasHeaders)
2.34%
pekko-http: PostBytes, PostLongBytes
vanilla-pekko-http
CPU for PostBytes oscillates around 4-6%, while for PostLongBytes it’s in the range of 35%-55%.
The PostBytes test makes the backend use no more than 75MB of memory, while PostLongBytes stabilizes at 3.8GB.
Let’s confirm with the flamegraph, that the routing part is not an issue in this case:
As for memory consumption, for PostBytes it’s less than 100 MB, while PostLongBytes stabilizes it around the level of 3.85 GB.
tapir-pekko-http
Wrapping with tapir makes the PostBytes CPU graph rise to 9-11%, while PostLongByets doesn’t look significantly different:
Memory consumption is up to 125MB for PostBytes, and occasionally 4GB for PostLongBytes. What about hot frames?
Beige elements, which are “Tapir flames” are pretty much the same patterns I saw in previous tests, so no new worrying data appears here.
Netty (Future): SimpleGet, SimpleGetMultiRoute
Since there’s no “Vanilla Netty”, we will focus on comparing netty-future vs netty-cats-effect backends. We are also interested in spotting any interesting hot frames on the flamegraph.
CPU is around 35%-38%. Memory consumption doesn’t exceed 512 MBs in either case.
A significant part of the consumed time is spent in Netty’s writeAndFlush
, which seems OK. Tapir’s logic shows pretty much the same patterns as previously, with a few hot frames that may be worth optimizing:
sttp.tapir.server.netty.NettyServerRequest.uri()
4.41%sttp.tapir.model.ServerRequest.acceptsContentTypes$(ServerRequest)
2.06%
It is not the first time we run into either parsing URI or “Accept” headers as an expensive part of the total collected samples. A GitHub issue has been added for NettyServerRequest.uri
.
Netty (Future): PostBytes, PostLongBytes
During these tests, CPU was between 15% and 23% for PostBytes, and between 40% and 55% for PostLongBytes.
Not much memory is needed in the case of PostBytes - 128MB. PostLongBytes requires up to 3.2GB.
Underneath, Tapir uses netty-reactive-streams with its Subscribers and Publishers. The flamegraph immediately shows that our SimpleSubscriber
does expensive work here:
sttp.tapir.server.netty.internal.reactivestreams.SimpleSubscriber.onNext(HttpContent)
18.7%
This is expected because onNext
does the actual job of reading Netty’s byte buffer and copying it into our internal representation.
However, there’s also
sttp.tapir.server.netty.internal.reactivestreams.SimpleSubscriber.onComplete()
18.86%
This part is responsible for iterating over all collected byte arrays, copying them into one single result array. This can definitely be optimized to avoid such a memory and CPU overhead. GitHub issue: 3548.
Netty (Cats Effect): SimpleGet, SimpleGetMultiRoute
The total load is slightly lower than in the Netty Future server.
Memory usage is higher than in the Future-based backend, up to 600MB.
The flamegraph shows similar patterns to the future-based implementation, where flames are “growing on top of” base frames like cats.effect.IOFiber.run
. The CPU spends significant time on the writeAndFlush
method, but that’s expected.
Netty (Cats Effect): PostBytes, PostLongBytes
When dealing with small POSTs, CPU reached 12% - 15%, while the PostLongBytes test made it reach values between 40 and 53%.
Memory consumption is 256MB for PostBytes, and up to 2GB for PostLongBytes.
In case of the Cats Effect based backend, the underlying reactive stream doesn’t use our custom code, but fs2’s implementation for these interfaces. That’s why we’re seeing fs2.Chunk.toArray
consuming so much CPU time, similarly to, for example, http4s. This pattern doesn’t look like anything we should be worried about. Other flames are standard Tapir blocks.
Play: SimpleGet, SimpleGetMultiRoute
In Part 1 we noticed that tapir-play suffers from high overhead when there are multiple routes. Let’s dive deeper.
play-vanilla
Our vanilla server reached 55-62% of CPU during these tests.
The backend takes up to 1.5GB of RAM in either case.
This can be quickly analyzed on the flamegraph, where we can see that the high number of routes made of composing partial functions with .orElse
is quite an expensive process.
We can expect a normal Play app to consume much less CPU when there are less routes and when these routes aren’t built dynamically using .orElse
.
tapir-play
Wrapping with Tapir shows SimpleGet and SimpleGetMulti hit a very high CPU load!
Consumed memory doesn’t cross 1GB for SimpleGet and 1.5GB for SimpleGetMulti.
The flamegraph confirms that PlayServerInterpreter.toRoutes is very expensive, dominating the CPU.
Play: PostBytes, PostLongBytes
play-vanilla
In these tests, the vanilla server doesn’t suffer from path concatenation issues, consuming 9% CPU for PostBytes and up to 23% for PostLongBytes.
Memory usage reaches 250MB for PostBytes and 790MB for PostLongBytes.
tapir-play
The CPU load is around 33-35% for PostBytes and oscillates around 50% for PostLongBytes.
Memory needed by the PostBytes test reaches 500MB, while for PostLongBytes it’s 5GB, a lot more than in the Vanilla server! The PlayServerInterpreter
doesn’t need to spend that much time on toRoutes
anymore:
Still, this frame and its parent frames indicate a non-trivial overhead that will be reduced once toRoutes
is fixed (GitHub issue). Other than that, we see an expected wide frame concerned with copyToArray
, caused by PostLongBytes and standard ServerInterpreter
operation stack.
Vert.X - SimpleGet, SimpleGetMultiRoute
vertx-vanilla
Vert.X CPU usage in these tests looks pretty impressive, reaching only up to 2%!
Memory consumption is also very low, stabilising below 50MB during SimpleGet and around 115MB during SimpleGetMultiRoute.
tapir-vertx (Future)
The Future-based tapir wrapped keeps very low CPU usage:
Memory usage for SimpleGet reaches 120MB, while SimpleGetMultiRoute requires 160MB.
Still, let’s see if the CPU usage can be optimized.
There are two frames that may be worth looking at to reduce the load:
sttp.tapir.server.vertx.decoders.VertxServerRequest.acceptsContentTypes()
3.34%sttp.tapir.server.vertx.decoders.VertxServerRequest.uri()
5.99%
This is a similar pattern to what we’ve seen in other servers, happening during Tapir’s standard input decoding using sttp-model parsers. A GitHub issue has been created to explore the possibility of spending less time parsing the Uri.
tapir-vertx (Cats Effect)
After wrapping the backend with Tapir based on Cats Effect, we notice that the GET tests consume more CPU, oscillating around 8%.
Memory usage reaches 200MB for SimpleGet and 300MB for SimpleGetMultiRoute.
The flamegraph doesn’t reveal any bottlenecks by Tapir, except the already known processing of Uris. Any additional load is introduced by Cats Effect runtime.
Vert.X - PostBytes, PostLongBytes
vertx-vanilla
During the PostBytes test Vert.X vanilla server still uses only 1-2% of CPU, while processing long byte arrays makes it a little more busy, with occasional peaks up to 13%.
Memory consumption for PostBytes still stays around 50MB, and the PostLongBytes test makes the server consume up to 3.75GB.
tapir-vertx (Future)
CPU usage looks almost identical to the vanilla server:
As for memory, it consumes up to 128MB during PostBytes and up to 3.5GB during PostLongBytes, but with much more frequent garbage collections.
The frame of decodeBody
is quite hot, because underneath it calls io.vertx.ext.web.RequestBody.buffer()
, which reads a large byte array. This could be potentially optimized by switching to the underlying Vert.X stream instead of calling buffer()
, so we’ll keep it as an issue worth investigating.
tapir-vertx (Cats Effect)
We can notice that there’s some additional load, causing PostBytes to rise up to 4-5%, and PostLongBytes up to 19%.
Memory needed for the tests reached 256MB for PostBytes, while PostLongBytes first consumed 4.75MB, to fall to a stable level of 1.75GB after 15 seconds.
This hot frame of VertxRequestBody.toRaw
may look like additional work compared to the Future-based backend. However, it’s actually an equivalent of what we saw there as the decodeBody operation, just reading the final buffer from the underlying large request. Again, there’s a potential gain here if we switch to the streaming representation, but le. Let’s not be overly optimistic about it - we saw that fs2 is pretty greedy for the CPU, for example, in the http4s and Netty backends.
Conclusions
Let’s sum up all the findings from our investigations (see the GitHub milestone for a list of issues):
DefaultServerLog
interceptor adds a significant overhead, and we now have more insights about the underlying causes - parsing Uri and spending time atsttp.tapir.Endpoint.showPathTemplate
andsttp.tapir.Endpoint.method
.- pekko-http routing issues are prevented by Tapir’s path pre-matching.
- pekko-http pays a non-trivial cost for calling
sttp-model/ContentType.parse
andsttp-model/HttpHeader.parse
, a small boost may be possible here if these methods are avoided or optimized. - The
SimpleSubscriber
used by Netty Future backend can be optimized by avoiding copying byte arrays inonComplete
, and possibly optimizing onNext. - PlayServerInterpreter.toRoutes is expensive and causes a very high overhead.
- All backends spend a few % of their time on sttpModel/Uri.unsafeParse, we should look into using it less frequently, switching to a different method, or optimizing sttp-model itself.
- Similarly with sttp-model/Accept.unsafeParse, but this would result in a smaller gain.
- Tapir adds up to 200MB of memory consumption, except for the Play backend under PostLongBytes test, where the overhead was quite an anomaly (4GB). In some cases, maximum memory usage was lower for tests run with Tapir.
WebSockets
Testing Methodology
The test is inspired by the websocket-benchmark project by Andriy Plokhotnyuk and Kamil Kloch. We want to simulate a very high number of users (2500) subscribing to endpoints, sending 600 requests and receiving responses at a constant rate. To achieve this, the endpoint reads data from a throttled stream, which emits elements every 100 milliseconds. Each user connects and sends a series of GET requests, to which the emitted throttled responses contain the timestamp of responding. The test client calculates latency manually by subtracting the time of receiving the response from the timestamp in its content. Data is registered using HdrHistogram. The code tests only the latency of the response flow back to the client.
Results
Most servers kept latency under 30ms, with the exception of vertx-tapir-ce (Cats Effect Tapir backend wrapping Vert.X):
Let’s remove it from the chart and first compare other backends:
The pekko-http backend is slightly slower, but just like other backends, it keeps up with the vanilla server up to 99.999%.
What about vertx-tapir-ce?
The chart above plots two additional lines:
- jvm-vertx-tapir-ce, showing the raw JVM latency
- control-vertx-tapir-ce showing the OS latency
These two lines allow us to verify how much of the latency is caused by CPU being busy outside of our app. They have been generated by injecting jHiccup Java Agent into the running server. It’s a great and simple OS instrumentation tool that can help analyze what constitutes the measured latency. One can use it to find bugs in tests themselves if the app latency shows better than JVM/OS latency.
There may be many reasons why our CPU is busy at the raw JVM level - it may be extensive GC pauses, overloaded threading striving to reach safepoints or some other reasons. However, we can still see that 80% of the load happens in app-related code.
After a quick analysis, we’ve noticed that vertx-tapir-ce websockets use sub-optimal fs2 stream operators for joining ping and pong signals, just as it has been recently noticed and fixed in the http4s-based backend. A quick test that removes these additional streams indeed shows a strong improvement:
This means that applying an analogous fix to http4s’s should be our first task. There’s still room for improvement, perhaps related to stream conversion between vertx-streams and fs2. Fortunately, we now have good tests that would let us quickly check how much we’ll gain by optimizing that code. Here's a GitHub issue to track this.
Next steps
We learned a lot about Tapir’s performance showing a lot of cases where its overhead is negligible, as well as some bottlenecks worth optimizing. The “performance” milestone on GitHub tracks all these tasks.
If you’re considering Tapir for your system and you know your performance requirements, you’ll most probably find a set of robust backends to choose from. But this is not the end! As a next step, I’d like to take a look at tapir-netty-loom and tapir-nima backends, which have been recently added as servers based on Java 21 Virtual Threads, potentially allowing us to achieve even better results. Stay tuned!