Contents

Benchmarking Tapir: Part 1

Krzysztof Ciesielski

19 Feb 2024.12 minutes read

Benchmarking Tapir: Part 1 webp image

Tapir is a well-adopted Scala library that makes building web APIs easier. It uses Scala's strong type system, letting you create endpoints, generate documentation, handle errors, work with cross-cutting concerns (interceptors, security), and work with JSON, all with the help of the compiler. You can then plug your API design into different Scala server backends like http4s, pekko-http, or Play. Over time, Tapir has become rich with features and very flexible. A large number of adopters shows it's very reliable, but we want to see how much overhead it adds on top of the servers it uses. Sometimes, GitHub issues show performance problems, like the great analysis fix Kamil Kloch made for WebSockets (check it out here). The time has come to do a proper deep and broad benchmark and lift Tapir to even higher level of reliability, with a solid set of tools allowing repeating performance tests easily and tracking the impact of fixes and updates that will come in the future.

Testing methodology

The main goal of part 1 is to compare Tapir servers with implementations with the same endpoints but written using only the underlying backends, AKA “Vanilla” counterparts. We are also going to compare Tapir servers with each other. The tests measure latency and throughput under different scenarios and conditions. Analysis of memory consumption and garbage collection is also crucial, but we’re leaving this and benchmarking WebSockets, for the next parts of our research.

We selected a few backends based on their popularity (Sonatype OSS downloads) and stability: http4s, Pekko, Play, Vert.x, and Netty. The last two are additionally divided into sub-flavors: Future-based and Cats-effect-based implementation.

The scenarios we’ll be checking are

  1. A simple GET request (SimpleGet)
  2. A POST request with a small byte array body (256B) (PostBytes)
  3. A POST request with a small String body (256B) (PostString)
  4. A POST request with a large byte array body (5MB) (PostLongBytes)
  5. A POST request with a large String body (5MB) (PostLongString)
  6. A simple GET request to /path127 (SimpleGetMulti)

The last one is a special scenario, which sends a GET /path127 request to a server with 128 endpoints named /path0 /path1, …, /path127. The http4s, Pekko, and Play servers try to match all the endpoints individually, so it may take them a significant amount of time to go through all the /pathN endpoints before hitting /path127. Tapir’s pre-matching of paths could reduce the latency of such requests, hence this specific test, we should also expect a similar optimization in a Vert.x server without Tapir. Another reason for including SimpleGetMulti is to show that in real-world scenarios, where there are many endpoints, performance may depend on more factors than just the overhead of handling a single endpoint.

All scenarios run for 25 seconds, preceded by 5 seconds of a warm-up execution, the results of which are not considered.

Multiple test runs show a variability of around 10% for throughput, so this has been included on charts as error bars.

Concurrent users

The tests have been executed for single and concurrent users (128). The multi-user test represents a high load and allows an overall comparison of backends. The single-user variant gives us a clear isolation; the requests are sent sequentially, and we can watch Tapir's throughput and latency footprints over a single flow without concurrency affecting it. It can be more reliably repeated in the future to check if performance fixes are beneficial.

Adding an interceptor

An interesting observation reported by Kamil Kloch suggests that the ServerLog interceptor may have an excessive impact on performance. We are not profiling CPU or memory yet; we are just checking the latency/throughput metrics, but it’s worthwhile to compare our results with ServerLog enabled and disabled. This will be another axis of our test matrix.

Different environments

I ran the tests primarily on my Linux PC, but I also wanted to verify if the analysis could be well replicated, so I repeated the process on a Macbook Pro. When discussing results, I’ll focus on values registered on the Linux machine. I’ll refer to MBP results a few times where it might be relevant.

Linux PC specs: AMD Ryzen 9 5950X (32) @3.4GHz, 64GB RAM
MBP specs: Intel i9-9880H (16) @2.3GHz, 64GB RAM

Quick summary of testing methodology:

  • 2 Environments: Linux PC, MBP
  • JDK 21, Scala 2.13 (forced by Gatling), sbt 1.9.8
  • Backends under test: http4s 0.23.25, Netty 4.1.106, pekko-http 1.0.0, Play 3.0.1, Vert.x 4.5.1
  • Scenarios: Simple GET, GET /path127, POST: byte array, string, long byte array, long string
  • Interceptors: Test with and without ServerLog
  • Workloads: single user with sequential requests, 128 concurrent users in some cases, and 3 concurrent users in one special case.
  • Measurements: mean throughput (reqs/s), latency percentiles (p99, p95, p75)

Testing harness

To cover all the possible scenarios and variables, I needed a convenient test-running system that allows executing a single command to run a suite. It has been implemented and documented as a sub-project in Tapir: perf-tests (GitHub link). We chose Gatling – a solid Scala tool for performance tests based on requests sent to a server. To read test suite parameters from the command line, I used scopt – a neat library that offers an elegant DSL to specify the rules for parsing the parameters and handles displaying a help screen as well as other hints:

Adding the possibility to specify entire groups of servers makes it easy to repeat specific configurations. For example, you can run:\

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s netty.future* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25 -u 128

to run selected simulations on all netty.future.* servers for 25 seconds and 128 concurrent users. This will include netty.future.Tapir, netty.future.TapirInterceptorMulti, etc. In this blog post, I commented on each discussed suite with its proper command that would reproduce the test.

A single test in a suite follows these steps:

  1. Start the backend
  2. Run a Gatling scenario in warm-up mode (3 users, 5 seconds)
  3. Run a Gatling scenario in “real” mode (users and duration from parameters)
  4. Stop the backend
  5. Read the latest Gatling’s simulation.log file to calculate metrics

This way, we ensure that each test starts on a clean server unaffected by previous tests yet warmed up with a set of requests identical to those used in the test.

Running Gatling in such a loop isn’t the standard way of running tests with this tool, but fortunately, it exposes a Gatling.fromMap(props) method, which can be leveraged.

After all tests are run, aggregated results are combined and put into HTML and CSV reports. To build the HTML report simply and cleanly, I used scalatags – a nice minimal library well suited for such purposes.

Results

First, let’s walk through each server individually and focus on the main Tapir vs Vanilla comparisons for different scenarios, user counts, interceptors, etc. Then, we’ll make an overall comparison between backends.

http4s

It is the most popular Tapir backend and one of the most established HTTP servers in the Scala ecosystem.

Let’s start our first test suite with:

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s http4s.* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25

where TapirMulti means a Tapir server with multiple (128) endpoints like GET /path0, path1, …, path127, same with VanillaMulti.

Our first test shows that Tapir is very close to the vanilla server for such an isolated scenario. The (i) variant adds a ServerLog interceptor, which seems to reduce the performance further, but let’s not forget the ~10% error rate. Latencies are 1ms or below (Gatling doesn’t offer sub-millisecond precision) for small requests and between 5ms and 8ms for large requests, so putting them on a chart won’t tell us much - we can assume that for such low values, there is no alarming overhead. We can keep focusing on the throughput.

A double-check of the same tests run on an MBP confirms no anomalies.

Command:

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s http4s.* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25 -u 128

Adding concurrency shows results which further strengthen some initial assumptions: ServerLog interceptor has a slight impact, but noticeable only under a heavy load of GETs. Also, looks like contention blurs the difference in throughput when handling POSTs.
Still, we rest assured that Tapir’s overhead doesn’t become an issue when concurrency kicks in.

Netty

Command:

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s netty.* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25

There is no notion of a “Vanilla Netty server” in our tests because raw Netty-based servers are not a typical setup. Netty represents a low-level communication layer over which other servers build their engines and APIs. We will thus analyze only Tapir backends based on Netty, one based on Futures, and another using Cats Effect.

Comparing them, we can notice that netty.cats handles small POST requests with significantly worse performance than the Future-based server, almost a 50% difference. Additionally, running tests leads to memory leaks reported by Netty itself in server logs and even results in the backend freezing if we add concurrency and let it run for over a few seconds. A fix has been applied, and our tests helped us to verify it. Compared to other servers, netty.future performs better than Play and http4s but a bit worse than pekko-http. How about throwing in some concurrency?


Command:

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s netty.future.* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25 -u 128

The server performs really well under such load, showing no anomalies. The Future-based server is no longer very much faster. It shows a ~10% better throughput than the one based on Cats-Effect. The overhead of the ServerLog interceptor isn’t noticeable for POST requests.

pekko-http

Forked from a hugely influential and popular akka-http server, battle-tested in many production setups.


Command:

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s pekko.* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25

A pure unwrapped Pekko server handled GET /path0 25% faster. Tapir is 17-20% slower than vanilla pekko-http for small POST requests, and up to 32% slower with the ServerLog interceptor. Another interesting observation is that requesting GET /path127 makes the vanilla server struggle quite a bit. The throughput drops to ~18% of Tapir’s interpreter in such a case! Looks like the mentioned path pre-matching really is a significant improvement.

How about running the tests with concurrent user requests?


Command:

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s pekko.* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25 -u 128

With a simple GET request, we can see that pekko-http outperforms its Tapir-wrapped counterpart by 9%, and 27% if we add an interceptor. However, the SimpleGetMulti scenario again shines on Tapir’s side. The pre-matching seems to be a key benefit, which lets it keep pace, while pekko-http runs into a significant bottleneck.

I repeated this test multiple times just to ensure that it consistently shows such a large difference: a drop from ~25k to ~5.5k. 128 concurrent users is quite a heavy load, though, so I checked a lighter scenario of 3 users. For such a case, the drop on the PC was from ~50k to ~5.5k, and on the MBP, it was from ~35k to ~2.5k. Still, quite a big deal.

Play

Command:

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s play.* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25

We instantly notice a huge difference between play.TapirMulti and play.VanillaMulti. It may be one of our first discovered very serious bottlenecks among Tapir interpreters. When we reduce the endpoints to a few, Tapir becomes robust (see play.Tapir and play.Tapir (i) on the chart). It’s still noticeably slower than the raw server, reaching 35%-45% of its throughput. As for the huge drop on a multi-endpoint server, the throughput of SimpleGetMulti is ~700 reqs/s for 64 endpoints, ~1700 req/sec for 32 endpoints, and ~2900 for 8 endpoints, so the overhead really kicks in even for smaller servers.

Now, the concurrent test:

Command:

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s play.* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25 -u 128

(This will run unnecessary tests like SimpleGetMulti on play.Tapir, so their results are removed from the chart)
With high congestion, for a small number of endpoints, Tapir-based server handles simple GETs with 60% of throughput of the vanilla server, but it manages to process small POST requests 16%-18% faster.

Vert.x


Command:

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s vertx.\* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25

The vanilla server is significantly faster when dealing with simple GETs, but it also seriously slows down for many endpoints, for which scenario Tapir isn’t that much affected. That’s probably another sign of the path pre-matching mechanism working to the benefit. The Cats-Effect-based backend is, as expected, not as fast as the Future-based one, but nothing out of the ordinary. Adding an interceptor, as with other backends, reduces the throughput by a few percent.


Command:

perfTests/Test/runMain sttp.tapir.perf.PerfTestSuiteRunner -s vertx.* -m SimpleGet,SimpleGetMultiRoute,PostBytes,PostString,PostLongBytes,PostLongString -d 25 -u 128

With concurrency, the Cats-Effect-based backend becomes noticeably faster than the Future-based one. It also reaches the performance of the vanilla server when processing POST requests. Simple GETs are 25% slower, but GET /path127 is over 30% faster on the Tapir-based backend. Also, the overhead introduced by interceptors becomes barely noticeable.

Comparing backends

Comparing all backends in the concurrent test runs, I ran tests with 128 and 32 endpoints for the Play backend to illustrate the performance drop depending on this number. All the other backends were tested with 128 endpoints. Netty servers are the fastest, but http4s and pekko-http are very close right behind them when it comes to handling small POSTs. However, GET handlers are significantly slower with http4s.

The PostLongBytes test deserves a separate ranking. We wanted to benchmark how underlying streaming mechanisms of different backends perform, and pekko-http is a clear winner. Our custom streaming components based on reactive streams in the Netty-based server are still very robust, comparable to streaming solutions underlying http4s and Play, so that’s good news.

Summary

The first round of performance tests allowed us to identify a few key issues to address in Tapir backends:

  • Play routing slows down quickly when increasing the number of endpoints
  • ServerLog interceptor adds overhead, which may accumulate with more interceptors
  • netty-cats backend needed urgent fixes for memory leaks

To answer the key question “What’s Tapir’s overhead” let’s look at the chart below:


The chart doesn’t include the Play backend because we’ve clearly seen it has issues. I have selected the “pessimistic” setup, with more than 128 endpoints, enabled interceptor, and very high concurrency - 128 users. 100% on the Y axis represents the reference point - a vanilla server. For example, tapir-http4s reaches ~80% of the tapir-less server throughput for GETs, but typical POST requests are handled with a pretty similar performance. The SimpleGetMulti scenario on some backends nicely exceeds vanilla servers. For pekko-http, Tapir’s throughput was 350x higher, way off the chart! We get initial confirmation that wrapping an HTTP server with Tapir often keeps good performance. Additionally, we learned that building Netty-based servers on top of Reactive Streams was worthwhile, and we’ll keep working on improving this backend flavor as one of our most lightweight and dependency-free ones. However, we must still dig deeper, especially profile memory consumption and benchmark WebSockets. Stay tuned!

Check: Benchmarking Tapir: Part 2 & Benchmarking Tapir: Part 3 (Loom)

Reviewed by: Adam Warski, Michał Matłoka

Blog Comments powered by Disqus.