What have I learned from Signal server source code
This year, effects of WhatsApp privacy controversies were described as “the largest digital migration in human history”. People start caring about their data and how they can be processed by the applications’ owners. Among the biggest winners of the mentioned exodus are IMs like Telegram or Signal. However, Telegram is not so perfect.
Signal seems to be the ONE that really cares and provides end to end encryption for the sent data. What is more, application code can be found on github under the Signal organization and was independently audited. This service has around 50 million users and even other IMs leverage its protocol!
Signal protocol can be used for end to end encryption of all messages, but also for video calls. It has passed security audits. An interesting fact is that WhatsApp, Facebook Messenger, and Skype actually decided to use Signal Protocol! However, in most cases they use it with encryption disabled, leaving it enabled only for specific private conversations.
Quite recently, for one of our Architecture Discussion Group meetings I did a quick Signal Server code analysis in order to understand how it works underneath. Today, I’d like to share what interesting things I found there and what I’ve learned from it.
About Signal Server
In this article, I will focus on the Signal Server. This is the main part of backend architecture providing REST API and WebSockets, being responsible for passing messages to appropriate users and storing them in the database. Keep in mind that it is not the only piece of the system.
Signal server is implemented in Java. It is built with Maven with a target for JDK 11. Among its dependencies, you can find:
- Dropwizard - framework for developing Java services. Signal depends on a few of its modules, including e.g. Dropwizard JDBI3 for SQL databases.
- resilience4j - library related to fault tolerance, implementing features like circuit breaker, retry or rate limiting.
- curve25519-java - Signal implementation of Curve25519 for java
- Bouncy Castle - cryptography related library
- AWS S3, SQS and DynamoDB SDKs
- jedis - Redis java client
- Lettuce - Another Redis java client, allowing for non-blocking operations
- Liquibase - library for database schema changes management
- libphonenumber - Google’s library for phone numbers verification
- mockito - mocking framework
- wiremock - allows for mocking HTTP APIs
- assertj - fluent assertions for java
Interactions with other modules and services
By looking at the sample configuration, we can clearly see what the peers of the Signal server are. At the top, we can find a
twilio module — related to sending SMS messages. What is more, there are a few blocks where Redis or Postgresql configs can be defined, which indicates that the service may not depend on a single (e.g. Redis) cluster, but there can be more of them, however, those are only speculations. Other most important config keys are related to AWS S3, GCP Storage, Apple Push or GCM.
Learned Fact #1 - Simplicity
The server code is quite simple, you won’t find the now popular “reactive” approach there. There is no clustering, no actor model. In most cases, you won’t even find java futures, however, their usage increased over the last year. I have found one comment on Signal forums that you can spawn it even in just 5 minutes. Does it scale? Yes! Signal leverages heavily Redis clusters (see next learned facts for more details!)
Learned Fact #2 - Redis Pub/Sub
Redis is known mostly as a simple key value store. It offers clustering capabilities and is available among providers of cloud offerings. At Signal, you can notice usage of other, more advanced features. The Pub/Sub feature, which, as the name suggests, is the simple Publish Subscribe messaging model, is a critical piece of Signal architecture. When the Signal client opens a websocket connection, it is stored in a special map keyed by the device identifier. When a new message is sent and the device is online, then it is added to redis. Proper pub/sub subscription reads it and checks if the special map contains the websocket connection for the given device. If yes, the message is delivered, otherwise it is ignored. Such scenario handles situations in which there is more than 1 active server (High Availability!).
Learned Fact #3 - Redis Lua
Redis again! Redis includes Lua interpreter. It is possible to define scripts which will be executed on the server side. You can say those are approximately stored procedures.
Multiple things happen there, among them you can find:
- At the top, script arguments are assigned to local variables.
- If given guid already exists, the script fetches the related messageId.
- New messageId is created by incrementing the “counter” value.
- ZADD adds messageId and the message itself to the queue representing given user device.
- Metadata queue includes info about sender and guid.
- EXPIRE is called, so that proper values are removed after given timeout.
This script is called for every message that needs to be queued because the device was offline. Later another worker executes different scripts to fetch the set values in order to store them in DynamoDB asynchronously.
Learned Fact #4 - Rate limiters
Signal REST endpoints include protection against too many requests. The source code includes Rate Limiters mechanisms, which allows to limit e.g. the number of endpoint executions per given message sender. Guess what? This mechanism is also based on Redis.
Learned Fact #5 - Cache!
Signal uses both DynamoDB and Redis. There is, however, one caveat. Direct interactions with DynamoDB are quite uncommon. Most writes go to Redis first. Later, the background processor fetches them and saves them to the database. That is why when pending messages for given users are fetched, the flow includes both query to the database and the cache, to include those messages which haven’t been written to DynamoDB yet. What is the motivation here? Performance and latency of course.
Learned Fact #6 - Postgres rules
Warning: Signal server code on GitHub was out of date for a year, it recently got updated. This lesson concerns the old code, new one replaced PostgreSQL with DynamoDB as the message store.
People's messages (of course encrypted) previously were stored in PostgreSQL in the messages table. Liquibase file messagedb.xml contained proper database and tables definitions. I’ve found there quite an interesting line:
CREATE RULE bounded_message_queue AS ON INSERT TO messages DO ALSO DELETE FROM messages WHERE id IN (SELECT id FROM messages WHERE destination = NEW.destination AND destination_device = NEW.destination_device ORDER BY timestamp DESC OFFSET 1000 );
This code leveraged PostgreSQL Rules mechanism. On every insert to messages table (on every new message), the oldest messages were deleted. Signal kept in the database only 1000 latest messages per-destination. Why? Probably to be even more secure and private. Theoretically, we can assume that a 1000 messages buffer is enough to deliver messages to proper devices. Currently, the DynamoDB based version leverages 7 day TTL.
Signal is a great IM, offering private conversations, however, the source code in GitHub repository was not updated for nearly a year. It got updated just at the beginning of April. This should not have influence over safety of data (since they are end to end encrypted and server can’t look at the content).
Reading other people's code allows one to learn various interesting facts and techniques. That is also the Open source power — you can dive into totally different topics that you meet on a daily basis. I encourage you to take a look, maybe at Signal, maybe at different projects.