Contents

Benchmarking Tapir: Part 2

Krzysztof Ciesielski

05 Mar 2024.15 minutes read

Benchmarking Tapir: Part 2 webp image

Introduction

In the previous part, I analyzed how a combination of Tapir+low-level server handles high load of HTTP requests in comparison with using unwrapped, “vanilla” servers directly. It allowed us to identify a few bottlenecks, as well as confirm Tapir’s low overhead in many cases. In this part, I’ll run some of these tests with a profiler to analyze how Tapir affects CPU usage, I’ll also check how much additional memory it consumes. Finally, I’ll test WebSockets in http4s, pekko-http and Vert.X backends, analyzing latency distribution.

Reflections on part 1

After getting feedback from the first series of tests, as well as preparing this second part, I gathered some insights:

  • Performance tests can fall prey to the Coordinated Omission Problem, which can be mitigated by ensuring a high number of concurrent users. When running part 2, I also checked Gatling’s reports showing throughput distribution over time, confirming no such omissions.
  • Measuring latency using Gatling gives us percentiles up to p99, but some important issues may be spotted only on higher percentiles. I added a High Dynamic Range histogram which collects up to 99.9999% in our tests, and used it exclusively for WebSocket tests. Latency comparison wasn’t a part of our previous analysis, but it will probably be revisited in the future with this more precise tool.

Profiling

Testing Methodology

I tested the same backends as in the previous part: http4s, Pekko-http, Play, Vert.X, and Netty. The last two are additionally divided into sub-flavors: Future-based and Cats-effect-based implementation. I ran the following test for each backend:

  • A simple GET request (SimpleGet)
  • A simple GET request to /path127 (SimpleGetMultiRoute)
  • A POST request with a small byte array body (256B) (PostBytes)
  • A POST request with a large byte array body (5MB) (PostLongBytes)

The SimpleGetMulti test is a special scenario, which sends a GET /path127 request to a server with 128 endpoints named /path0 /path1, …, /path127. Part 1 has revealed that routing to such a “distant” endpoint may be pretty expensive for some backends, while Tapir corrects for this cost by optimizing with its unique path pre-matching mechanism.

All scenarios run on 128 concurrent users for 30 seconds, preceded by 5 seconds of a warm-up execution, the results of which are not considered.
Tests were executed on a Linux PC with AMD Ryzen 9 5950X (32) @3.4GHz, 64GB RAM, using JDK 21, Scala 2.13 (forced by Gatling), sbt 1.9.8.
For profiling, I’m using IntelliJ IDEA with a built-in Async Profiler, additionally reading memory consumption from generated jfr snapshots using Java Mission Control.

Let’s now analyze the results for the mentioned tests and servers.

http4s: SimpleGet, SimpleGetMultiRoute

Vanilla http4s shows CPU reaching 50%, with frequent drops to 15-20%.
cpu-vanilla-get
Memory usage reaches 900MB in both cases.

tapir-http4s

CPU usage looks similar, although a bit lower, with smaller drops in the SimpleGetMultiRoute test, which isn’t very relevant.
cpu-tapir-get
Memory usage stays on the same level, occasionally rising to 1GB.

Analyzing Flamegraphs

Flamegraphs are very useful visualizations showing what percentage of samples was taken in different code locations, allowing to identify “hot frames”. A flamegraph can also filter tapir-specific frames and show how much in total they take. For example in the SimpleGet + SimpleGetMultiRoute test, tapir frames consume 10.38% of CPU samples. However, this number doesn’t tell us very much about the overhead. Some Tapir code replaces native backend code (like request/response encoding and decoding in some cases). The total CPU consumption may be then useful only when comparing two tests over similar Tapir backends, like Tapir vs Tapir+DefaultServerLog interceptor.

Individual hot frames on the chart may sometimes be worth investigating on their own, let’s see what takes CPU when Tapir is added.

flame-tapir-get
Disclaimer: each frame represents samples registered when CPU was busy in the given method. The wider, the more samples it contains, with sub-frames on top of it representing a stack of its constituents. Beige frames are Tapir’s code, while heather color represents the rest (java and libraries).

DecodeBasicInputs is part of basic Tapir logic so we expect to see it occupying the CPU a bit. We are going to see a pattern of such a flame growing on top of Tapir’s ServerInterpreter very often. Similarly, a secondary flame (or even 2 flames) calling sttp.tapir.server.interpreter.ResultOrValue - they are going to be the main pattern. Let’s dive and see if child frames of such frames can contain some CPU-occupying code that’s worth optimizing.

cpu-tapir-get-2
If anything noteworthy, in case of the GET tests we ran into Http4sServerRequest.acceptContentTypes occupying 1.72% of the CPU. Indeed, underneath, there’s some pretty complex parsing logic from sttp-model library, parsing the “Accepts” headers. It’s some CPU cost, but within reasonable ranges, so let’s not treat it as an immediate target for optimization.

tapir-http4s + DefaultServerLog Interceptor

The first part of my blog post shows that Tapir’s DefaultLogInterceptor significantly affects throughput. Since it’s turned on by default, it’s an important part to fix, especially when we look at https://github.com/softwaremill/tapir/issues/3272, where Kamil Kloch analyzes this impact in detail.

Running SimpleGet and SimpleGetMultiRoute tests and looking into the flamegraph shows:

http4s-cpu-itapir-get
That’s quite a hit, so let’s deconstruct this offending frame, to find that the most expensive sub-frames are:

  • sttp.tapir.server.http4s.Http4sServerRequest.uri() 7.64% (!)
  • sttp.model.Uri.toString() 1.96%
  • sttp.tapir.Endpoint.showPathTemplate(Function2, Option, boolean, String, Option, Option) 4.44%
  • sttp.tapir.Endpoint.method() 1.75%

Voila, we have our first set of candidates to improve, registered as GitHub issues 3545 and 3546.

http4s: PostBytes, PostLongBytes

These two tests analyze Tapir’s performance hit when receiving byte arrays, especially in cases where they need to be built from multiple parts, possibly incorporating underlying streams.

http4s-vanilla

Vanilla http4s shows CPU busy at 8-11% for PostBytes and up to 60% for PostLongBytes, with frequent drops even down to 35%.

http4s-cpu-vanilla-post
Consumed memory is around 120 MB for PostBytes, while PostLongBytes makes it keep reaching levels near 2GB.

tapir-http4s

CPU for PostBytes looks similar to the vanilla run, but for PostLongBytes it’s quite different, reaching no more than 50% in highs and deep to ~20% in lows.

http4s-cpu-tapir-post
Memory usage reaches 256MB for PostBytes, while PostLongBytes it consumes 2.2GB.

Let’s take a look at the flamegraph.

Here’s an interesting example showing how we may get misled by thinking that Tapir’s code adds a lot of overhead. At first, we can see that sttp.tapir.server.http4s.Http4sRequestBody takes a lot of CPU, essentially running fs2.Chunk.toArray underneath.

http4s-tapir-flame-post
However, the vanilla server also needs to run fs2.Chunk.toArray at some point, so we shouldn’t really interpret Tapir’s code as extra work:

http4s-vanilla-flame-post
In such case we should look for hot frames elsewhere, but nothing extra shows up in addition to what we’ve already noticed in SimpleGet and SimpleGetMultiRoute tests.

pekko-http: SimpleGet, SimpleGetMultiRoute

vanilla-pekko-http

Starting with SimpleGet and SimpleGetMultiRoute tests, we can see much higher CPU pressure than in our http4s tests!

pekko-vanilla-cpu-get
Memory consumption reaches 512MB for SimpleGet, and 1.5GB for SimpleGetMultiRoute.

Let’s quickly look at the flamegraph:
pekko-vanilla-flame-get

The culprit lies in the way pekko-http deals with multiple routes concatenated together to do the matching. In the first part, we saw how seriously it affects throughput. Now we can confirm that it’s indeed the routing that is responsible, additionally eating a lot of CPU unnecessarily.

tapir-pekko-http

pekko-tapir-cpu-get
Memory consumption seems lower with Tapir, stabilizing around 380MB for SimpleGet and 1.5GB for SimpleGetMulti. With Tapir, CPU load in SimpleGetMultiRoute test seems to be gone. Let’s confirm by looking at the flamegraph:

pekko-tapir-flame-get
Indeed, routing doesn’t seem to be a concern anymore. Apart from the standard minimal overhead caused by the components ServerInterpreter, we see a few hot frames when preparing a response. They may be worth optimizing (GitHub issue):

  • sttp.tapir.server.pekkohttp.PekkoServerRequest.acceptsContentTypes() 2.4%
  • sttp.tapir.server.pekkohttp.PekkoToResponseBody.parseContentType(String) 3.53%
  • sttp.tapir.server.pekkohttp.PekkoModel$.parseHeadersOrThrowWithoutContentHeaders(HasHeaders) 2.34%

pekko-http: PostBytes, PostLongBytes

vanilla-pekko-http

CPU for PostBytes oscillates around 4-6%, while for PostLongBytes it’s in the range of 35%-55%.

pekko-vanilla-cpu-post
The PostBytes test makes the backend use no more than 75MB of memory, while PostLongBytes stabilizes at 3.8GB.

Let’s confirm with the flamegraph, that the routing part is not an issue in this case:

pekko-vanilla-flame-post
As for memory consumption, for PostBytes it’s less than 100 MB, while PostLongBytes stabilizes it around the level of 3.85 GB.

tapir-pekko-http

Wrapping with tapir makes the PostBytes CPU graph rise to 9-11%, while PostLongByets doesn’t look significantly different:

pekko-tapir-cpu-post
Memory consumption is up to 125MB for PostBytes, and occasionally 4GB for PostLongBytes. What about hot frames?

pekko-tapir-flame-post
Beige elements, which are “Tapir flames” are pretty much the same patterns I saw in previous tests, so no new worrying data appears here.

Netty (Future): SimpleGet, SimpleGetMultiRoute

Since there’s no “Vanilla Netty”, we will focus on comparing netty-future vs netty-cats-effect backends. We are also interested in spotting any interesting hot frames on the flamegraph.

netty-future-cpu-get
CPU is around 35%-38%. Memory consumption doesn’t exceed 512 MBs in either case.

netty-future-flame-get
A significant part of the consumed time is spent in Netty’s writeAndFlush, which seems OK. Tapir’s logic shows pretty much the same patterns as previously, with a few hot frames that may be worth optimizing:

  • sttp.tapir.server.netty.NettyServerRequest.uri() 4.41%
  • sttp.tapir.model.ServerRequest.acceptsContentTypes$(ServerRequest) 2.06%

It is not the first time we run into either parsing URI or “Accept” headers as an expensive part of the total collected samples. A GitHub issue has been added for NettyServerRequest.uri.

Netty (Future): PostBytes, PostLongBytes

During these tests, CPU was between 15% and 23% for PostBytes, and between 40% and 55% for PostLongBytes.

netty-future-cpu-post
Not much memory is needed in the case of PostBytes - 128MB. PostLongBytes requires up to 3.2GB.

netty-future-flame-post
Underneath, Tapir uses netty-reactive-streams with its Subscribers and Publishers. The flamegraph immediately shows that our SimpleSubscriber does expensive work here:

  • sttp.tapir.server.netty.internal.reactivestreams.SimpleSubscriber.onNext(HttpContent) 18.7%

This is expected because onNext does the actual job of reading Netty’s byte buffer and copying it into our internal representation.
However, there’s also

  • sttp.tapir.server.netty.internal.reactivestreams.SimpleSubscriber.onComplete() 18.86%

This part is responsible for iterating over all collected byte arrays, copying them into one single result array. This can definitely be optimized to avoid such a memory and CPU overhead. GitHub issue: 3548.

Netty (Cats Effect): SimpleGet, SimpleGetMultiRoute

The total load is slightly lower than in the Netty Future server.

netty-cats-cpu-get
Memory usage is higher than in the Future-based backend, up to 600MB.

netty-cats-flame-get
The flamegraph shows similar patterns to the future-based implementation, where flames are “growing on top of” base frames like cats.effect.IOFiber.run. The CPU spends significant time on the writeAndFlush method, but that’s expected.

Netty (Cats Effect): PostBytes, PostLongBytes

When dealing with small POSTs, CPU reached 12% - 15%, while the PostLongBytes test made it reach values between 40 and 53%.

netty-cats-cpu-post
Memory consumption is 256MB for PostBytes, and up to 2GB for PostLongBytes.

netty-cats-flame-post
In case of the Cats Effect based backend, the underlying reactive stream doesn’t use our custom code, but fs2’s implementation for these interfaces. That’s why we’re seeing fs2.Chunk.toArray consuming so much CPU time, similarly to, for example, http4s. This pattern doesn’t look like anything we should be worried about. Other flames are standard Tapir blocks.

Play: SimpleGet, SimpleGetMultiRoute

In Part 1 we noticed that tapir-play suffers from high overhead when there are multiple routes. Let’s dive deeper.

play-vanilla

Our vanilla server reached 55-62% of CPU during these tests.

play-vanilla-cpu-get
The backend takes up to 1.5GB of RAM in either case.

This can be quickly analyzed on the flamegraph, where we can see that the high number of routes made of composing partial functions with .orElse is quite an expensive process.

play-vanilla-flame-get
We can expect a normal Play app to consume much less CPU when there are less routes and when these routes aren’t built dynamically using .orElse.

tapir-play

Wrapping with Tapir shows SimpleGet and SimpleGetMulti hit a very high CPU load!

play-tapir-cpu-get
Consumed memory doesn’t cross 1GB for SimpleGet and 1.5GB for SimpleGetMulti.

play-tapir-flame-get
The flamegraph confirms that PlayServerInterpreter.toRoutes is very expensive, dominating the CPU.

Play: PostBytes, PostLongBytes

play-vanilla

In these tests, the vanilla server doesn’t suffer from path concatenation issues, consuming 9% CPU for PostBytes and up to 23% for PostLongBytes.

play-vanilla-cpu-post
Memory usage reaches 250MB for PostBytes and 790MB for PostLongBytes.

tapir-play

The CPU load is around 33-35% for PostBytes and oscillates around 50% for PostLongBytes.

play-tapir-cpu-post
Memory needed by the PostBytes test reaches 500MB, while for PostLongBytes it’s 5GB, a lot more than in the Vanilla server! The PlayServerInterpreter doesn’t need to spend that much time on toRoutes anymore:

play-tapir-flame-post
Still, this frame and its parent frames indicate a non-trivial overhead that will be reduced once toRoutes is fixed (GitHub issue). Other than that, we see an expected wide frame concerned with copyToArray, caused by PostLongBytes and standard ServerInterpreter operation stack.

Vert.X - SimpleGet, SimpleGetMultiRoute

vertx-vanilla

Vert.X CPU usage in these tests looks pretty impressive, reaching only up to 2%!

vertx-vanilla-cpu-get
Memory consumption is also very low, stabilising below 50MB during SimpleGet and around 115MB during SimpleGetMultiRoute.

tapir-vertx (Future)

The Future-based tapir wrapped keeps very low CPU usage:

vertx-tapir-future-get-cpu
Memory usage for SimpleGet reaches 120MB, while SimpleGetMultiRoute requires 160MB.
Still, let’s see if the CPU usage can be optimized.

vertx-tapir-future-get-cpu
There are two frames that may be worth looking at to reduce the load:

  • sttp.tapir.server.vertx.decoders.VertxServerRequest.acceptsContentTypes() 3.34%
  • sttp.tapir.server.vertx.decoders.VertxServerRequest.uri() 5.99%

This is a similar pattern to what we’ve seen in other servers, happening during Tapir’s standard input decoding using sttp-model parsers. A GitHub issue has been created to explore the possibility of spending less time parsing the Uri.

tapir-vertx (Cats Effect)

After wrapping the backend with Tapir based on Cats Effect, we notice that the GET tests consume more CPU, oscillating around 8%.

vertx-ce-tapir-get-cpu
Memory usage reaches 200MB for SimpleGet and 300MB for SimpleGetMultiRoute.
vertx-ce-tapir-flame-get
The flamegraph doesn’t reveal any bottlenecks by Tapir, except the already known processing of Uris. Any additional load is introduced by Cats Effect runtime.

Vert.X - PostBytes, PostLongBytes

vertx-vanilla

During the PostBytes test Vert.X vanilla server still uses only 1-2% of CPU, while processing long byte arrays makes it a little more busy, with occasional peaks up to 13%.

vertx-vanilla-cpu-post
Memory consumption for PostBytes still stays around 50MB, and the PostLongBytes test makes the server consume up to 3.75GB.

tapir-vertx (Future)

CPU usage looks almost identical to the vanilla server:

vertx-tapir-cpu-post
As for memory, it consumes up to 128MB during PostBytes and up to 3.5GB during PostLongBytes, but with much more frequent garbage collections.

vertx-tapir-flame-post
The frame of decodeBody is quite hot, because underneath it calls io.vertx.ext.web.RequestBody.buffer(), which reads a large byte array. This could be potentially optimized by switching to the underlying Vert.X stream instead of calling buffer(), so we’ll keep it as an issue worth investigating.

tapir-vertx (Cats Effect)

We can notice that there’s some additional load, causing PostBytes to rise up to 4-5%, and PostLongBytes up to 19%.

vertx-ce-tapir-cpu-post
Memory needed for the tests reached 256MB for PostBytes, while PostLongBytes first consumed 4.75MB, to fall to a stable level of 1.75GB after 15 seconds.

vertx-ce-tapir-flame-post
This hot frame of VertxRequestBody.toRaw may look like additional work compared to the Future-based backend. However, it’s actually an equivalent of what we saw there as the decodeBody operation, just reading the final buffer from the underlying large request. Again, there’s a potential gain here if we switch to the streaming representation, but le. Let’s not be overly optimistic about it - we saw that fs2 is pretty greedy for the CPU, for example, in the http4s and Netty backends.

Conclusions

Let’s sum up all the findings from our investigations (see the GitHub milestone for a list of issues):

  • DefaultServerLog interceptor adds a significant overhead, and we now have more insights about the underlying causes - parsing Uri and spending time at sttp.tapir.Endpoint.showPathTemplate and sttp.tapir.Endpoint.method.
  • pekko-http routing issues are prevented by Tapir’s path pre-matching.
  • pekko-http pays a non-trivial cost for calling sttp-model/ContentType.parse and sttp-model/HttpHeader.parse, a small boost may be possible here if these methods are avoided or optimized.
  • The SimpleSubscriber used by Netty Future backend can be optimized by avoiding copying byte arrays in onComplete, and possibly optimizing onNext.
  • PlayServerInterpreter.toRoutes is expensive and causes a very high overhead.
  • All backends spend a few % of their time on sttpModel/Uri.unsafeParse, we should look into using it less frequently, switching to a different method, or optimizing sttp-model itself.
  • Similarly with sttp-model/Accept.unsafeParse, but this would result in a smaller gain.
  • Tapir adds up to 200MB of memory consumption, except for the Play backend under PostLongBytes test, where the overhead was quite an anomaly (4GB). In some cases, maximum memory usage was lower for tests run with Tapir.

WebSockets

Testing Methodology

The test is inspired by the websocket-benchmark project by Andriy Plokhotnyuk and Kamil Kloch. We want to simulate a very high number of users (2500) subscribing to endpoints, sending 600 requests and receiving responses at a constant rate. To achieve this, the endpoint reads data from a throttled stream, which emits elements every 100 milliseconds. Each user connects and sends a series of GET requests, to which the emitted throttled responses contain the timestamp of responding. The test client calculates latency manually by subtracting the time of receiving the response from the timestamp in its content. Data is registered using HdrHistogram. The code tests only the latency of the response flow back to the client.

Results

Most servers kept latency under 30ms, with the exception of vertx-tapir-ce (Cats Effect Tapir backend wrapping Vert.X):

ws-all
Let’s remove it from the chart and first compare other backends:

ws-all-but-vertx
The pekko-http backend is slightly slower, but just like other backends, it keeps up with the vanilla server up to 99.999%.
What about vertx-tapir-ce?

vertx-tapir-ce
The chart above plots two additional lines:

  • jvm-vertx-tapir-ce, showing the raw JVM latency
  • control-vertx-tapir-ce showing the OS latency

These two lines allow us to verify how much of the latency is caused by CPU being busy outside of our app. They have been generated by injecting jHiccup Java Agent into the running server. It’s a great and simple OS instrumentation tool that can help analyze what constitutes the measured latency. One can use it to find bugs in tests themselves if the app latency shows better than JVM/OS latency.

There may be many reasons why our CPU is busy at the raw JVM level - it may be extensive GC pauses, overloaded threading striving to reach safepoints or some other reasons. However, we can still see that 80% of the load happens in app-related code.

After a quick analysis, we’ve noticed that vertx-tapir-ce websockets use sub-optimal fs2 stream operators for joining ping and pong signals, just as it has been recently noticed and fixed in the http4s-based backend. A quick test that removes these additional streams indeed shows a strong improvement:

vertx-tapir-ce-fast
This means that applying an analogous fix to http4s’s should be our first task. There’s still room for improvement, perhaps related to stream conversion between vertx-streams and fs2. Fortunately, we now have good tests that would let us quickly check how much we’ll gain by optimizing that code. Here's a GitHub issue to track this.

Next steps

We learned a lot about Tapir’s performance showing a lot of cases where its overhead is negligible, as well as some bottlenecks worth optimizing. The “performance” milestone on GitHub tracks all these tasks.

If you’re considering Tapir for your system and you know your performance requirements, you’ll most probably find a set of robust backends to choose from. But this is not the end! As a next step, I’d like to take a look at tapir-netty-loom and tapir-nima backends, which have been recently added as servers based on Java 21 Virtual Threads, potentially allowing us to achieve even better results. Stay tuned!

Blog Comments powered by Disqus.