Message queues are central to many distributed systems and often provide a backbone for asynchronous processing and communication between (micro)services. They are useful in a number of situations; any time we want to execute a task asynchronously, we put the task on a queue and some executor (could be another thread, process or machine) eventually runs it. Or when a stream of events arrives at our system, we want them to be received quickly, and processed asynchronously by other components later.

Various message queue implementations can give various guarantees on message persistence and delivery. For some use-cases, it is enough to have an in-memory, volatile message queue. For others, we want to be sure that once the message send completes, the message is persistently enqueued and will be eventually delivered, despite node or system crashes.

In the following tests we will be looking at systems on the 'safe' side of this spectrum, which try to make sure that messages are not lost by:

  • persisting messages to disk
  • replicating messages across the network

We will examine the characteristics of a number of message queueing and data streaming systems, comparing their features, replication schemes, supported protocols, operational complexity and performance. All of these factors might impact which system is best suited for a given task. In some cases, you might need top-performance, which might come with tradeoffs in terms of other features. In others, performance isn’t the main factor, but instead compatibility with existing protocols, message routing capabilities or deployment overhead play the central role.

When talking about performance, we’ll take into account both how many messages per second a given queueing system can process, but also at the processing latency, which is an important factor in systems which should react to new data in near real-time (as is often the case with various event streams). Another important aspect is message send latency, that is how long it takes for a client to be sure a message is persisted in the queueing system. This may directly impact e.g. the latency of http endpoints and end-user experience.

Version history

8 Dec 20202020 edition: extended feature comparison, updated benchmarks, new queues (Pulsar, PostgreSQL, Nats Streaming); dropping ActiveMQ 5 in favor of ActiveMQ Artemis
1 August 2017Updated the results for Artemis, using memory-mapped journal type and improved JMS test client
18 July 20172017 edition: updating with new versions; adding latency measurements; adding Artemis and EventStore
4 May 20152015 edition: updated with new versions, added ActiveMQ; new site
1 July 2014original at Adam Warski's blog

The 2020 edition is co-authored with Kasper Kondzielski.

Tested queues

There is a number of open-source messaging projects available, but only some support both persistence and replication. We'll evaluate the performance and characteristics of 10 message queues, in no particular order:

You might rightfully notice that not all of these are message queueing systems. Both MongoDB and PostgreSQL (and to some degree, EventStore) are general-purpose databases. However, using some of their mechanisms it’s possible to implement a message queue on top of them. If such a simple queue meets the requirements and the database system is already deployed for other purposes, it might be reasonable to reuse it and reduce operational costs.

Except for SQS, all of the systems are open-source, and can be self-hosted on your own servers or using any of the cloud providers, both directly or through Kubernetes. Some systems are also available as hosted, as-a-service offerings.

Queue characteristics

We’ll be testing and examining the performance of a single, specific queue usage scenario in detail, as well as discussing other possible queue-specific use-cases more generally.

In our scenario, as mentioned in the introduction, we’ll put a focus on safety. The scenario tries to reflect a reasonable default that you might start with when developing applications leveraging a message queue, however by definition we’ll cover only a fraction of possible use-cases.

There are three basic operations on a queue which we'll be using:

  • sending a message to the queue
  • receiving a message from the queue
  • acknowledging that a message has been processed

On the sender side, we want to have a guarantee that if a message send call completes successfully, the message will be eventually processed. Of course, we will never get a 100% "guarantee", so we have to accept some scenarios in which messages will be lost, such as a catastrophic failure destroying all of our geographically distributed servers. Still, we want to minimise message loss. That's why:

  • messages should survive a restart of the server, that is messages should be persisted to a durable store (hard disk). However, we accept losing messages due to unflushed disk buffers (we do not require fsyncs to be done for each message).
  • messages should survive a permanent failure of a server, that is messages should be replicated to other servers. We'll be mostly interested in synchronous replication, that is when the send call can only complete after the data is replicated. Note that this additionally protects from message loss due to hard disk buffers not being flushed. Some systems also offer asynchronous replication, where messages are accepted before being replicated, and thus there's more potential for message loss. We'll make it clear later which systems offer what kind of replication.

On the receiver side, we want to be able to receive a message and then acknowledge that the message was processed successfully. Note that receiving alone should not remove the message from the queue, as the receiver may crash at any time (including right after receiving, before processing). But that could also lead to messages being processed twice (e.g. if the receiver crashes after processing, before acknowledging); hence our queue should support at-least-once delivery.

With at-least-once delivery, message processing should be idempotent, that is processing a message twice shouldn't cause any problems. Once we assume that characteristic, a lot of things are simplified; message acknowledgments can be done asynchronously, as no harm is done if an ack is lost and the message re-delivered. We also don't have to worry about distributed transactions. In fact, no system can provide exactly-once delivery when integrating with external systems (and if it claims otherwise: read the fine print); it's always a choice between at-most-once and at-least-once.

By requiring idempotent processing, the life of the message broker is easier, however, the cost is shifted to writing application code appropriately.

Performance testing methodology

We'll be performing three measurements during the tests:

  • throughput in messages/second: how fast on average the queue is, that is how many messages per second can be sent, and how many messages per second can be received & acknowledged
  • 95th percentile of processing latency (over a 1-minute window): how much time (in milliseconds) passes between a message send and a message receive. That is, how fast the broker passes the message from the sender to the receiver
  • 95th percentile of send latency (over a 1-minute window): how long it takes for a message send to complete. That's when we are sure that the message is safely persisted in the cluster, and can e.g. respond to our client's http request "message received".

When setting up the queues, our goal is to have 3 identical, replicated nodes running the message queue server, with automatic fail-over. That’s not always possible with every queue implementation, hence we’ll make it explicit what’s the replication setup in each case.

The sources for the tests as well as the Ansible scripts used to setup the queues are available on GitHub.

Each test run is parametrised by the type of the message queue tested, optional message queue parameters, number of client nodes, number of threads on each client node and message count. A client node is either sending or receiving messages; in the tests we used from 1 to 20 client nodes of each type, each running from 1 to 100 threads. By default there are twice as many receiver nodes as sender nodes, but that’s not a strict rule and we’re modifying these proportions basing on what’s working best for a given queue implementation.

Each Sender thread tries to send the given number of messages as fast as possible, in batches of random size between 1 and 10 messages.

The Receiver tries to receive messages (also in batches of up to 10 messages), and after receiving them, acknowledges their delivery (which should cause the message to be removed from the queue). The test ends when no messages are received for a minute.

MQ test setup

The queues have to implement the Mq interface. The methods should have the following characteristics:

  • send should be synchronous, that is when it completes, we want to be sure (what "sure" means exactly may vary) that the messages are sent
  • receive should receive messages from the queue and block them from being received by other clients; if the node crashes, the messages should be returned to the queue and re-delivered, either immediately or after a time-out
  • ack should acknowledge delivery and processing of the messages. Acknowledgments can be asynchronous, that is we don't have to be sure that the messages really got deleted

Server setup

Both the clients, and the messaging servers used r5.2xlarge memory-optimized EC2 instances; each such instance has 8 virtual CPUs, 64GiB of RAM and SSD storage (in some cases additional gp2 disks where used).

All instances were started in a single availability zone (eu-west-1). While for production deployments it is certainly better to have the replicas distributed across different locations (in EC2 terminology - different availability zones), but as the aim of the test was to measure performance, a single availability zone was used to minimise the effects of network latency as much as possible.

The servers were provisioned automatically using Ansible. All of the playbooks are available in the github repository, hence the tests should be reproducible.

Test results were aggregated using Prometheus and visualized using Grafana. We'll see some dashboard snapshots with specific results later.

While the above might not guarantee the best possible performance for each queue (we might have used r5.24xlarge for example), the goal was to get some common ground for comparison between the various systems. Hence, the results should be treated only comparatively, and the tests should always be repeated in the target environment before making any decisions.


Versionserver 4.2, java driver 3.12.5
Replicationconfigurable, asynchronous & synchronous
Replication typeactive-passive

Mongo has two main features which make it possible to easily implement a durable, replicated message queue on top of it: very simple replication setup (we'll be using a 3-node replica set), and various document-level atomic operations, like find-and-modify. The implementation is just a handful of lines of code; take a look at MongoMq.

Replication in Mongo follows a leader-follower setup, that is there’s a single node handling writes, which get replicated to follower nodes. As an optimization, reads can be offloaded to followers, but we won’t be using this feature here. Horizontal scaling is possible by sharding and using multiple replica sets, but as far as message queueing is concerned this would make the queue implementation significantly more complex. Hence this queue implementation is bound by the capacity of the leader node.

Network partitions (split-brain scenarios), which are one of the most dangerous fault types in replicated databases, are handled by making sure that only the partition with the majority of nodes is operational.

We are able to control the guarantees which send gives us by using an appropriate write concern when writing new messages:

  • WriteConcern.W1 ensures that once a send completes, the messages have been written to disk (but the buffers may not be yet flushed, so it's not a 100% guarantee) on a single node; this corresponds to asynchronous replication
  • WriteConcern.W2 ensures that a message is written to at least 2 nodes (as we have 3 nodes in total, that's a majority) in the cluster; this corresponds to synchronous replication

The main downside of the Mongo-based queue is that:

  • messages can't be received in bulk – the find-and-modify operation only works on a single document at a time
  • when there's a lot of connections trying to receive messages, the collection will encounter a lot of contention, and all operations are serialised.

And this shows in the results: sends are much faster than receives. But the performance isn’t bad despite this.

A single-thread, single-node, synchronous replication setup achieves 958 msgs/s sent and received. The maximum send throughput with multiple thread/nodes that we were able to achieve is about 11 286 msgs/s (25 threads, 2 nodes), while the maximum receive rate is 7 612 msgs/s (25 threads, 2 nodes). An interesting thing to note is that the receive throughput quickly achieves its maximum value, and adding more threads (clients) only decreases performance. The more concurrency, the lower overall throughput.


With asynchronous replication, the results are of course better: up to 38 120 msg/s sent and 8 130 msgs/s received. As you can see, here the difference between send and receive performance is even bigger.

What about latencies? In both synchronous and asynchronous tests, the send latency is about 48 ms, and this doesn't deteriorate when adding more concurrent clients.

As for the processing latency, measurements only make sense when the receive rate is the same as the sending rate. When the clients aren't able to receive messages as fast as they are sent, the processing time goes arbitrarily up.

With 2 nodes running 5 threads each, Mongo achieved a throughput of 3 913 msgs/s with a processing latency of 48 ms. Anything above that caused receives to fall back behind sends. Here's the dashboard for that test:

MongoDB grafana

Performance results in detail when using synchronous replication are as follows:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
5123 913,003 913,0048,0048,00
251210 090,007 612,0060000,0048,00
1241 607,001 607,0048,0048,00
5245 532,005 440,0060000,0048,00
252411 286,004 489,0060000,0053,00

Overall, if you are already using Mongo in your application, you have small traffic and don’t need to use any of the more advanced messaging protocols or features, a queue implementation on top of Mongo might work just fine.


Versionserver 12.4, java driver 42.2.12
Replicationconfigurable, asynchronous & synchronous
Replication typeactive-passive

When implementing a queue on top of PostgreSQL, we are using a single jobs table:

  content TEXT NOT NULL, 
  next_delivery TIMESTAMPTZ NOT NULL)

Sending a message amounts to inserting data to the table. Receiving a message bumps the next delivery timestamp, making the message invisible for other receivers for some period of time (during which we assume that the message should be processed, or is otherwise redelivered). This is similar to how SQS works, which is discussed next. Acknowledging a message amounts to deleting the message from the database.

When receiving messages, we issue two queries (in a single transaction!). The first looks up the messages to receive, and puts a write lock on them. The second updates the next delivery timestamp:

SELECT id, content FROM jobs WHERE next_delivery <= $now FOR UPDATE SKIP LOCKED LIMIT n
UPDATE jobs SET next_delivery = $nextDelivery WHERE id IN (...)

Thanks to transactionality, we make sure that a single message is received by a single receiver at any time. Using write locks, FOR UPDATE and SKIP LOCKED help improve the performance by allowing multiple clients to receive messages concurrently, trying to minimise contention.

As with other messaging systems, we replicate data. PostgreSQL uses leader-follower replication, by setting the following configuration options, as described in a blog post by Kasper:

  • synchronous_standby_names is set to ANY 1 (slave1, slave2)
  • synchronous_commit is set to remote_write

It’s also possible to configure asynchronous replication, as well as require an fsync after each write. However, an important limitation of PostgreSQL is that by default, there’s no automatic failover. In case the leader fails, one of the followers must be promoted by hand (which, in a way, solves the split brain problem). There are however both open-source and commercial solutions, which provide modules for automatic failover.

In terms of performance, a baseline single-thread setup achieves around 3 800 msgs/s sent and received. Such a queue can handle at most 23 000 msgs/s sent and received using 5 threads on 2 sending and 4 receiving nodes:

PostgreSQL grafana

Increasing concurrency above that causes receive performance to degrade:


Send latency is usually at 48ms. However, total processing latency is quite poor. As with Mongo, taking into account only the tests where the send throughput was on par with receive throughput, processing latency varied from 1172 ms to 17 975 ms. Here are the results in full:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
1124 034,003 741,0060000,0047,00
51215 073,0015 263,005738,0048,00
25122 267,002 317,0017957,00846,00
1246 648,007 530,0060000,0048,00
52423 220,0023 070,001172,0048,00
252428 867,0020 492,0060000,0049,00
14811 621,0011 624,001372,0048,00
54823 300,0023 216,004730,0048,00
84822 300,0020 869,0060000,0049,00

Same as with MongoDB, if you require a very basic message queue implementation without a lot of traffic, and already have a replicated PostgreSQL instance in your deployment, such an implementation might be a good choice. Things to look out for in this case are long processing latencies and manual failover, unless third-party extensions are used.

Event Store

Version20.6.1, JVM client 7.3.0
Replication typeactive-passive

EventStore is first and foremost a database for event sourcing and complex event processing. However, it also supports the competing consumers pattern, or as we know it: message queueing. How does it stack up comparing to other message brokers?

EventStore offers a lot in terms of creating event streams, subscribing to them and transforming through projections. In the tests we'll only be writing events to a stream (each message will become an event), and create persistent subscriptions (that is, subscriptions where the consumption state is stored on the server) to read events on the clients. You can read more about event sourcing, competing consumers and subscription types in the docs.

To safely replicate data, EventStore uses quorum-based replication, using a gossip protocol to disseminate knowledge about the cluster state. A majority of nodes has to confirm every write for it to be considered successful. That’s also how resilience against split brain is implemented.

As all of the tests are JVM-based, we'll be using the JVM client, which is built on top of Akka and hence fully non-blocking. However, the test framework is synchronous - because of that, the EventStoreMq implementation hides the asynchronous nature behind synchronous sender and receiver interfaces. Even though the tests will be using multiple threads, all of them will be using only one connection to EventStore per node.

Comparing to the default configuration, the client has a few modified options:

  • readBatchSize, historyBufferSize and maxCheckPointSize are all bumped to 1000 to allow more messages to be pre-fetched
  • the in-flight messages buffer size is increased from the default 10 to a 1000. As this is by default not configurable in the JVM client, we had to copy some code from the driver and adjust the properties (see the MyPersistentSubscriptionActor class)

How does EventStore perform? A baseline setup achieves 793 msgs/s, and when using 3 sender with 25 threads, and 4 receiver nodes, in the tests we have achieved a throughput of 33 427 msgs/s with the 95th percentile of processing latency being at most 251 ms and the send latency 49 ms. Receive rates are stable, as are the latencies:

EventStore grafna

Here's a summary of the EventStore tests that we've run:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
251216 243,0016 241,00123,0048,00
252429 076,0029 060,00128,0048,00
253433 427,0033 429,00251,0049,00
254429 321,0027 792,0060000,0048,00


EventStore also provides a handy web console. Comparing to the other general-purpose databases (PostgreSQL and MongoDB), EventStore offers the best performance, but it’s also the most specialised, oriented towards working with event streams in the first place.


VersionAmazon Java SDK 1.11.797
Replication type?

SQS, Simple Message Queue, is a message-queue-as-a-service offering from Amazon Web Services. It supports only a handful of messaging operations, far from the complexity of e.g. AMQP, but thanks to the easy to understand interfaces, and the as-a-service nature, it is very useful in a number of situations.

The primary interface to access SQS and send messages is using an SQS-specific HTTP API. SQS provides at-least-once delivery. It also guarantees that if a send completes, the message is replicated to multiple nodes; quoting from the website:

"Amazon SQS runs within Amazon's high-availability data centers, so queues will be available whenever applications need them. To prevent messages from being lost or becoming unavailable, all messages are stored redundantly across multiple servers and data centers."

When receiving a message, it is blocked from other receivers for a period of time called the visibility timeout. If the message isn’t deleted (acknowledged) before that time passes, it will be re-delivered, as the system assumes that previous processing has failed. SQS also offers features such as deduplication ids and FIFO queues. For testing, the ElasticMQ projects offers an in-memory implementation.

We don't really know how SQS is implemented, but it most probably spreads the load across many servers, so including it here is a bit of an unfair competition: the other systems use a single fixed 3-node replicated cluster, while SQS can employ multiple replicated clusters and route/balance the messages between them. Still, it might be interesting to compare to self-hosted solutions.

A baseline single thread setup achieves 592 msgs/s sent and the same number of msgs received, with a processing latency of 113 ms and send latency of 49 ms.

These results are not impressive, but SQS scales nicely both when increasing the number of threads, and the number of nodes. On a single node, with 50 threads, we can send up to 22 687 msgs/s, and receive on two nodes up to 14 423 msgs/s.

With 12 sender and 24 receiver nodes, these numbers go up to 130 956 msgs/s sent, and 130 976 msgs/s received! However, at these message rates, the service costs might outweigh the costs of setting up a self-hosted message broker.


As for latencies, SQS can be quite unpredictable than other queues which we’ll cover later. We've observed processing latency from 94 ms up to 1 960 ms. Send latencies are more constrained and are usually around 50 ms.

Here's the dashboard for the test using 4 nodes, each running 5 threads.

SQS grafana

And full test results:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
5122 680,002 682,00277,0049,00
251213 345,0013 370,001 960,0049,00
501222 687,0014 423,0060 000,0049,00
1241 068,001 068,0097,0049,00
5245 383,005 377,00106,0049,00
252426 576,0026 557,00302,0049,00
1482 244,002 245,0097,0048,00
54811 353,0011 356,0090,0048,00
254844 586,0044 590,00256,0050,00
18163 651,003 651,0097,0050,00
581617 575,0017 577,0094,0050,00
2581684 512,0084 512,00237,0050,00
112245 168,005 168,0096,0050,00
5122425 735,0025 738,0094,0050,00
251224130 956,00130 976,00213,0050,00


Version3.8.5-1, java amqp client 5.9.0
Replication typeactive-passive

RabbitMQ is one of the leading open-source messaging systems. It is written in Erlang, implements AMQP and is a very popular choice when messaging is involved; using RabbitMQ it is possible to define complex message delivery topologies. It supports both message persistence and replication.

We'll be testing a 3-node Rabbit cluster, using quorum queues, which are a relatively new addition to what RabbitMQ offers. Quorum queues are based on the Raft consensus algorithm; a leader is automatically elected in case of node failure, as long as a majority of nodes are available. That way, data is safe also in case of network partitions.

To be sure that sends complete successfully, we'll be using publisher confirms, a Rabbit extension to AMQP, instead of transactions:

"Using standard AMQP 0-9-1, the only way to guarantee that a message isn't lost is by using transactions -- make the channel transactional, publish the message, commit. In this case, transactions are unnecessarily heavyweight and decrease throughput by a factor of 250. To remedy this, a confirmation mechanism was introduced."

A message is confirmed after it has been replicated to a majority of nodes (this is where Raft is used). Moreover, messages have to be written to disk and fsynced.

Such strong guarantees are probably one of the reasons for mediocre performance. A basic single-thread setup achieves around 2 000 msgs/s sent&received, with a processing latency of 18 000 ms and send latency of 48 ms. The queue can be scaled up to 19 000 msgs/s using 25 threads, 4 sender nodes and 8 receiver nodes:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
1122 064,001 991,0018 980,0048,00
5128 146,008 140,0098,0048,00
251217 334,0017 321,00122,0048,00
1243 994,003 983,001 452,0048,00
52412 714,0012 730,0099,0048,00
252419 120,0019 126,00142,0048,00
1486 939,006 941,0098,0048,00
54816 687,0016 685,00116,0048,00
254819 035,0019 034,00190,0071,00


Let’s take a closer look at the test which achieves highest performance:

RabbitMQ grafana

As you can see, the receive rate, send and processing latencies are quite stable - which is also an important characteristic to examine under load. The processing latency is around 190 ms, while the send latency in this case is 71 ms.

Note that messages are always sent and received at the same rate, which would indicate that message sending is the limiting factor when it comes to throughput. Rabbit's performance is a consequence of some of the features it offers, for a comparison with Kafka see for example this Quora question.

The RabbitMq implementation of the Mq interface is again pretty straightforward. We are using the mentioned publisher confirms, and setting the quality-of-service when receiving so that at most 100 messages are delivered unconfirmed (in-flight).

An important side-node: RabbitMQ has a great web-based console, available with almost no setup, which offers some very good insights into how the queue is performing.

ActiveMQ Artemis

Version2.15.0, java driver 2.15.0
Replication typeactive-passive

Artemis is the successor to popular ActiveMQ 5 (which hasn’t seen any significant development lately). Artemis emerged from a donation of the HornetQ code to Apache, and is being developed by both RedHat and ActiveMQ developers. Like RabbitMQ, it supports AMQP, as well as other messaging protocols, for example STOMP and MQTT.

Artemis supports a couple of high availability deployment options, either using replication or a shared store. We’ll be using the over-the-network setup, that is replication.

Unlike other tested brokers, Artemis replicates data to one backup node. The basic unit here is a live-backup pair. The backup happens synchronously, that is a message is considered sent only when it is replicated to the other server. Failover and failback can be configured to happen automatically, without operator intervention.

Moreover, queues in Artemis can be sharded across multiple live-backup pairs. That is, we can deploy a couple of such pairs and use them as a single cluster. As we aren’t able to create a three-node cluster, instead we’ll use a six-node setup in a “star” configuration: three live (leader) servers, all of which serve traffic of the queue used for tests. Each of them has a backup server.

Split-brain issues are addressed by an implementation of quorum voting. This is similar to what we’ve seen e.g. in the RabbitMQ implementation.

The Artemis test client code is based on JMS, and doesn’t contain any Artmis-specific code - uses only standard JMS concepts - sending messages, receiving and transactions. We only need to use an Artemis-specific connection factory, see ArtemisMq.

The configuration changes comparing to the default are:

  • the Xmx java parameter bumped to 48G
  • in broker.xml, the global-max-size setting changed to 48G
  • journal-type set to MAPPED
  • journal-datasync, journal-sync-non-transactional and journal-sync-transactional all set to false

Performance wise, Artemis does very well. Our baseline single-thread setup achieves 13 650 msgs/s. By adding nodes, we can scale that result to 52 820 msgs/s using 4 sending nodes, 8 receiver nodes each running 25 threads:


In that last case, the 95th percentile of send latency is a stable 48 ms and maximum processing latency of 49 ms:

Artemis grafana

Here are all of the results:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
11213 647,0013 648,0048,0044,00
51236 035,0036 021,0047,0046,00
251243 643,0043 630,0048,0048,00
12421 380,0021 379,0047,0045,00
52439 320,0039 316,0048,0047,00
252451 089,0051 090,0048,0048,00
14835 538,0035 525,0047,0046,00
54842 881,0042 882,0048,0047,00
254852 820,0052 826,0049,0048,00
561244 435,0044 436,0048,0048,00
2561249 279,0049 278,0077,0048,00

Artemis also offers a web console which helps to visualise the current cluster state.

Artemis dashboard

NATS Streaming

Version0.19.0, java driver 2.2.3
Replication typeactive-passive

NATS is a lightweight messaging system, popular especially in the domain of IoT applications. It supports a number of communication patterns, such as request-reply, publish-subscribe and wildcard subscriptions. NATS offers clients in most popular languages, as well as integrations with many external systems. However, in itself, NATS doesn’t offer clustering, replication or message acknowledgments.

For that purpose, NATS Streaming builds upon NATS, providing support for replicated, persistent messaging, durable subscriptions and, using acknowledgements, guarantees at-least-once delivery. It embeds a NATS server, extending its protocol with additional capabilities.

A NATS Streaming server stores an ever-growing log of messages, which are deleted after reaching the configured size, message count or message age limit (in this respect, the design is similar to a Kafka topic’s retention policy). The server is simple to setup - not a lot of configuration is needed. The client APIs are similarly straightforward to use.

As with other queue implementations discussed previously, NATS Streaming uses the Raft protocol for replicating data in a cluster. A write is successful only after a successful consensus round - when the majority of nodes accept it. Hence, this design should be resilient against split-brain scenarios.

There’s a single leader node, which accepts writes. This means (as the documentation emphasises), that this setup isn’t horizontally scalable. An alternate version of a NATS-based clustered system - JetStream is being developed, which promises horizontal scalability.

What’s interesting is a whole section in the docs dedicated to the use-cases of at-least-once, persistent messaging - when to use it, and more importantly, when not to use it:

Just be aware that using an at least once guarantee is the facet of messaging with the highest cost in terms of compute and storage. The NATS Maintainers highly recommend a strategy of defaulting to core NATS using a service pattern (request/reply) to guarantee delivery at the application level and using streaming only when necessary.

It’s always good to consider your architectural requirements, but in our tests of course we’ll focus on the replicated & persistent setup. Speaking of tests, our baseline test achieved 1 725 msgs/s. This scales up to 27 400 msgs/s when using 25 threads on 6 senders nodes, and 12 receiver nodes.

NATS Streaming

Latencies are also looking good, with 95th send percentile being at most 95 ms, while messages have been usually processed within 148 ms.

NATS Streaming grafana

Here’s a summary of the test runs:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
1121 725,001 725,00105,0048,00
5123 976,003 995,00142,0048,00
251210 642,0010 657,00145,0048,00
1241 870,001 870,00138,0048,00
5245 958,005 957,00143,0048,00
252418 023,0018 026,00145,0048,00
1483 379,003 377,00143,0048,00
54810 014,0010 015,00145,0048,00
254824 834,0024 828,00146,0048,00
2561227 392,0027 388,00147,0075,00
2581626 699,0026 696,00148,0095,00


Replicationconfigurable, asynchronous & synchronous
Replication typeactive-active

Apache Pulsar is a distributed streaming and messaging platform. It is often positioned in a similar segment as Apache Kafka, and the two platforms are often compared and contrasted.

Pulsar was initially developed at Yahoo!, and now continues to evolve as an open-source project. It builds upon two other Apache projects:

A Pulsar deployment consists of nodes which take on one of three nodes:

  • bookie: handles persistent storage of messages
  • broker: a stateless service which accepts messages from producers, dispatches messages to consumers and communicates with bookies to store data
  • zookeeper: which provides coordination services for the above two

Hence, a minimal deployment should in fact consist of more than 3 nodes (although we can colocate a couple of roles on a single machine). For our tests we have decided to use separate machines for separate roles, and hence we ended up with 3 zookeeper nodes, 3 bookie nodes and 2 broker nodes.

When working with Pulsar, we’re dealing with three main concepts: messages, topics and subscriptions. Producers send messages to topics, either individually or in batches. Consumers can subscribe to a topic in four modes: exclusive, failover, shared and key_shared, providing a subscription name.

Combining a shared or unique subscription name, with one of the four consumption modes, we can achieve pub-sub topics, message queues, or a combination of these behaviours. Pulsar is very flexible in this regard.

Messages in Pulsar are deleted after they are acknowledged, and this is tracked per-subscription. That is, if there are no subscribers to a topic, messages will be marked for deletion right after being sent. Acknowledging a message in one subscription doesn’t affect other subscriptions. Additionally, we can specify a message retention policy, to keep messages for a longer time.

Moreover, topics can be partitioned. Behind the scenes, Pulsar creates an internal topic for every partition (these partitions are something quite different than in Kafka!). However, from the producers and consumers point of view such a topic behaves as a single one. As a single topic is always handled by a single broker, increasing the number of partitions, we can increase throughput by allowing multiple brokers to accept and dispatch messages.

As mentioned above, all storage is handled by Apache BookKeeper. Entries (messages) are stored in sequences called ledgers. We can configure how many copies of a ledger are created (managedLedgerDefaultEnsembleSize), in how many copies a message is stored (managedLedgerDefaultWriteQuorum) and how many nodes have to acknowledge a write (managedLedgerDefaultAckQuorum). Following our persistency requirements, we’ve been using 3 ledger copies, and requiring at least 2 copies of each message.

The setting above corresponds to synchronous replication, but by setting the quorum to 1 or 0, we would get an asynchronous one.

Unlike previously discussed queues, pulsar is an active-active system: that is, every node is equal and can handle user requests. Coordination is performed via Zookeeper, which also secures the cluster against split-brain problems.

Pulsar offers a number of additional features, such as Pulsar Functions, SQL, transactions, geo replication, multi-tenancy, connectors to many popular data processing systems (Pulsar IO), a schema registry and others.

Performance-wise, it shows that each node can handle messaging traffic. A baseline setup using a single partition achieves 1 300 msgs/s. Using 8 sender and 16 receiver nodes, each running 25 threads, we get 147 000 msgs/s.

However, we can also increase the number of partitions, thus increasing concurrency. We achieved the best results using 4 partitions (that is, a single broker was handling 2 partitions on average); adding more partitions didn’t further increase performance. Here, we got up to 358 000 msgs/s using 8 sender nodes each running 100 threads, and 16 receiver nodes each runnin 25 threads.


Send latencies are stable, and the 95th percentile is 48 ms. Processing latencies vary from 48 ms, to at most 214 ms in the test which achieved highest throughput.

Pulsar grafana

Here are the full test results, for 1 partition:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
1121 298,001 298,0048,0048,00
5126 711,006 711,00112,0048,00
251231 497,0031 527,0048,0048,00
1242 652,002 652,0048,0048,00
52412 787,0012 789,00107,0048,00
252455 621,0055 677,0050,0048,00
1485 156,005 156,0072,0048,00
54824 048,0024 054,0094,0048,00
254896 154,0096 272,0050,0048,00
25612124 152,00124 273,0050,0048,00
50612160 237,00160 254,00102,0048,00
25816147 348,00147 405,0050,0048,00

And using 4 partitions:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
1485 248,005 237,0061,0048,00
2548102 821,00102 965,0050,0048,00
25612141 462,00141 977,0050,0048,00
50612228 875,00228 958,0073,0048,00
25816176 439,00176 388,0050,0048,00
50816259 133,00259 203,0064,0048,00
75/25816333 622,00333 643,0065,0048,00
100/25816358 323,00358 648,00214,0049,00
501020260 070,00260 165,0094,0048,00
100/251016320 853,00320 315,002 698,0049,00


Replicationconfigurable, asynchronous & synchronous
Replication typeactive-passive

RocketMQ is a unified messaging engine and lightweight data processing platform. The message broker was initially created as a replacement for ActiveMQ 5 (not the Artemis version we discussed before, but its predecessor). It aims to support similar use-cases, provides JMS and native interfaces (among others), and puts a focus on performance.

There are three node types from which a RocketMQ cluster is created:

  • broker master, which accepts client connections, receives and sends messages
  • broker slave, which replicates data from the master
  • name server, which provides service discovery and routing

Each broker cluster can work in synchronous or asynchronous replication modes, which is configured on the broker level. In our tests, we’ve been using synchronous replication.

In theory, it should be possible to deploy a broker cluster with a single master and two slaves, to achieve a replication factor of 3. However, we couldn’t get this setup to work. Hence instead, we’ve settled on a similar configuration as with ActiveMQ Artemis - three copies of master-slave pairs. Like with Artemis, a queue can be deployed on multiple brokers, and the messages are sharded/load-balanced when producing and consuming from the topic.

Additionally, we’ve deployed a single name server, but in production deployments, this component should be clustered as well, with a minimum of three nodes.

Speaking of topics, RocketMQ supports both pub-sub topics, as well as typical message queues, where each message is consumed by a single consumer. This corresponds to BROADCAST and CLUSTERING message consumption modes. Moreover, messages can be consumed in-order, or concurrently (we’ve been using the latter option).

Messages are consumed and acknowledged per consumer-group, which is specified when configuring the consumer. When creating a new consumer group, historical messages can be received, as long as they are still available; by default, RocketMQ retains messages for 2 days.

RocketMQ supports transactions, however there’s no built-in deduplication. Moreover, the documentation is quite basic, making this system a bit challenging to setup and understand. There’s no mention if and which consensus algorithm is used, and if split-brain scenarios are in any way addressed; however, there is a recommendation to deploy at least 3 name servers, which would hint at a quorum-based approach.

However, RocketMQ definitely makes up for these deficiencies in performance. Our baseline test with a single sender and 1 thread achieved 13 600 msgs/s. However, processing latency was quite large in that particular test - 37 seconds. It’s quite easy to overwhelm RocketMQ with sends so that the receiver threads can’t keep up. The most we’ve been able to achieve where sends are receives are on par is with 4 sender nodes, 4 receiver nodes running 25 threads each. In that case, the broker processed 485 000 msgs/s.


Send latencies are always within 44-47ms, however as mentioned, processing latencies get high pretty quickly. The highest throughput with reasonable processing latencies (162 ms) achieved 129 100 msgs/s.

RocketMQ grafana

Here’s a summary of our tests:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
11213 605,0014 183,0037 056,0044,00
51264 638,0064 635,0094,0044,00
2512260 093,00252 308,0018 859,0045,00
12429 076,0029 075,00135,0043,00
524129 106,00129 097,00162,0044,00
2524411 923,00410 891,0017 436,0046,00
2536451 454,00422 619,0060 000,0046,00
14855 662,0055 667,00960,0044,00
548202 110,00147 859,0060 000,0045,00
2544485 322,00416 900,0060 000,0047,00


Replicationconfigurable, asynchronous & synchronous
Replication typeactive-active

Kafka is a distributed event-streaming platform. It is widely deployed and has gained considerable popularity in recent years. Originally developed at LinkedIn, it is now an open-source project, with commercial extensions and support offered by Confluent.

A Kafka cluster consists of a number of broker nodes, which handle persistence, replication, client connections: they both accept and send messages. In addition, there’s a ZooKeeper cluster which is used for service discovery and coordination. However, there are plans to replace that component with one built directly into the Kafka broker.

Kafka takes a different approach to messaging, compared to what we’ve seen before. The server itself is a streaming publish-subscribe system, or at an even more basic level, a distributed log. Each Kafka topic can have multiple partitions; by using more partitions, the consumers of the messages (and the throughput) may be scaled and concurrency of processing increased. It’s not uncommon for a topic to have 10s or 100s of partitions.

On top of the publish-subscribe system, which persists messages within partitions, point-to-point messaging (queueing) is built, by putting a significant amount of logic into the consumers. This again contrasts Kafka when comparing with other messaging systems we've looked at: there, usually it was the server that contained most of the message-consumed-by-one-consumer logic. Here it's the consumer.

Each consumer in a consumer group reads messages from a number of dedicated partitions; hence it doesn't make sense to have more consumer threads than partitions. Or in other words, a single partition is consumed by exactly one consumer within a consumer group (as long as there are any consumers).

Messages aren't acknowledged on the server (which is a very important design difference!), but instead processed message offsets are managed by consumers and written per-parition back to a special Kafka store (or a client-specific store), either automatically in the background, or manually. This allows Kafka to achieve much better performance.

Such a design has a couple of consequences:

  • only messages from each partition are processed in-order. A custom partition-routing strategy can be defined
  • all consumers should consume messages at the same speed. Messages from a slow consumer won't be "taken over" by a fast consumer
  • messages are acknowledged "up to" an offset. That is messages can't be selectively acknowledged.
  • no "advanced" messaging options are available, such as routing or delaying message delivery.

You can read more about the design of the consumer in Kafka's docs, which are quite comprehensive and provide a good starting point when setting up the broker.

It is also possible to add a layer on top of Kafka to implement individual message acknowledgments and re-delivery, see our article on the performance of kmq and the KMQ project. This scheme uses an additional topic to track message acknowledgements. In case a message isn’t acknowledged within specified time, it is re-delivered. This is quite similar to how SQS works. When testing Kafka, we’ve primarily tested “vanilla” Kafka, but also included a KMQ test for comparison.

To achieve guaranteed sends and at-least-once delivery, we used the following configuration (see the KafkaMq class):

  • topic is created with a replication-factor of 3
  • for the sender, the request.required.acks option is set to 1 (asynchronous replication - a send request blocks until it is accepted by the partition leader) or -1 (synchronous replication; in conjunction with min.insync.replicas topic config set to 2 a send request blocks until it is accepted by at least 2 replicas - a quorum when we have 3 nodes in total)
  • consumer offsets are committed every 10 seconds manually; during that time, message receiving is blocked (a read-write lock is used to assure that). That way we can achieve at-least-once delivery (only committing when messages have been "observed").

It’s important to get the above configuration right. You can read more about proper no-data-loss Kafka configuration on our blog, as well as how to guarantee message ordering: by default, even within a partition, messages might be reordered!

As Kafka uses ZooKeeper, network partitions are handled at that level. Kafka has a number of features which are useful when designing a messaging or data streaming system, such as deduplication, transactions, a SQL interface, connectors to multiple popular data processing systems, a schema registry and a streaming framework with in-Kafka exactly-once processing guarantees.

Let’s look at the performance tests. Here, Kafka has no equals, the numbers are impressive. A baseline test achieved around 7 000 msgs/s. Using 8 sender nodes and 16 receiver nodes, running 25 threads each, we can achieve 270 000 msgs/s.

However, we didn’t stop here. It turns out that the sending part is the bottleneck (and it might not be surprising, as that’s where most coordination happens: we wait for messages to be persisted and acknowledged; while on the receiver side, we allow asynchronous, periodic offset commits). By using 200 threads on 16 sender nodes, with 16 receiver nodes, but running only 5 threads each, we achieved 828 000 msgs/s.


We’ve been using at least 64 partitions for the tests, scaling this up if there were more total receiver threads, to 80 or 100 partitions.

What about latencies? They are very stable, even under high load. 95th percentile of both send and receives latencies is steadily at 48 ms. Here’s the dashboard from the test run with the biggest throughput:

Kafka grafana

As mentioned before, we’ve also tested a setup with selective message acknowledgments, using KMQ (the implementation is here). Adding another topic for tracking redeliveries, and performing additional message marker sends did impact performance, but not that much. Using 100 threads on 20 senders, and 5 threads on 16 senders, we’ve achieved a throughput of 676 800 msgs/s. However, processing latencies went up to about 1 812 ms:

KMQ grafana

Finally, here are our Kafka test results in full:

ThreadsSender nodesReceiver nodesSend msgs/sReceive msgs/sProcessing latencySend latency
1127 458,007 463,0047,0047,00
51231 350,0031 361,0047,0047,00
251292 373,0092 331,0081,0047,00
12415 184,0015 175,0047,0047,00
52455 402,0055 355,0047,0047,00
2524127 274,00127 345,0050,0048,00
14827 044,0027 045,0047,0047,00
54884 234,0084 223,0048,0047,00
2548188 557,00188 524,0048,0048,00
25612233 379,00233 228,0048,0048,00
25816272 828,00272 705,0048,0048,00
25/5816235 782,00235 802,0048,0048,00
50/5816338 591,00338 614,0048,0048,00
75/5816432 049,00432 071,0048,0048,00
100/5816498 528,00498 498,0048,0048,00
25/51020245 284,00245 304,0048,0048,00
50/51616507 393,00507 475,0048,0048,00
100/51616678 255,00678 279,0048,0048,00
150/51616745 203,00745 163,0049,0049,00
200/51616828 836,00828 827,0049,0049,00
200/52016810 555,00810 553,0077,0077,00

Summary of features

Below you can find a summary of some of the characteristics of the queues that we’ve tested. Of course this list isn’t comprehensive, rather it touches on areas that we’ve mentioned above, and which are important when considering replication, message persistence and data safety. However, each system has a number of unique features, which are out of scope here.

Summary of features

Which queue to choose?

It depends! Unfortunately, there are no easy answers to such a question :).

As always, which message queue to choose depends on specific project requirements. All of the above solutions have some good sides:

  • SQS is an as-a-service offering, so especially if you are using the AWS cloud, it's an easy choice: good performance and no setup required. It's cheap for low to moderate workloads, but might get expensive with high load
  • if you are already using Mongo, PostgreSQL or EventStore, you can either use it as a message queue or easily build a message queue on top of the database, without the need to create and maintain a separate messaging cluster
  • if you want to have high persistence guarantees, RabbitMQ ensures replication across the cluster and on disk on message send. It's a very popular choice used in many projects, with full AMQP implementation and support for many messaging topologies
  • ActiveMQ Artemis is a popular, battle-tested and widely used messaging broker with wide protocol support and good performance
  • RocketMQ offers a JMS-compatible interface, with great performance
  • Pulsar builds provides a wide feature set, with many messaging schemes available. It’s gaining popularity, due to it’s flexible nature, accommodating for a wide range of use-cases, and great performance
  • Kafka offers the best performance and scalability, at the cost of feature set. It is the de-facto standard for processing event streams across enterprises.

Here’s a summary of the performance tests. First, zooming in on our database-based queues, Rabbit, NATS Streaming and Artemis, with SQS for comparison:

Summary mqs

And including all of the tested queues:

Summary throughput

Finally, the processing latency has a wider distribution across the brokers. Usually, it's below 150ms - with RocketMQ, PostgreSQL and KMQ faring worse under high load:

Summary processing latency

There are of course many other aspects besides performance, which should be taken into account when choosing a message queue, such as administration overhead, network partition tolerance, feature set regarding routing, documentation quality, maturity etc. While there's no message-queueing silver bullet, hopefully this summary will be useful when choosing the best system for your project!


The following team members contributed to this work: Grzegorz Kocur, Maciej Opała, Marcin Kubala, Krzysztof Ciesielski, Kasper Kondzielski, Tomasz Król, Adam Warski. Clebert Suconic, Michael André Pearce and Greg Young helped out with some configuration aspects of Artemis and EventStore. Thanks!

Blog Comments powered by Disqus.
Find more articles like this in Blog section