Caches in Microservice architecture
Probably no article about caching can do without a famous quote about two hard things in Computer Science, so I will also start with it:
There are only two hard things in computer science: cache invalidation and naming things.
— Phil Karlton
So if doing caching right is that troublesome, why should we care? Maybe the solution would be to just avoid doing it at any cost? Unfortunately, that’s not that easy, primarily if we aim for microservices architecture. Building a performant, distributed system with no caches could be very tricky.
Caching is a very broad term itself. It can mean caching HTTP responses by reverse proxy on the front end of our application or database calls on the deep backend. Let’s dive deeper into the topic.
Why caching matters
A microservice is an autonomous part of a software system encapsulating a chunk of its logic. Usually, it is also liable for managing some part of the system’s data (for instance, by ensuring it won’t get corrupted by invalid updates). This responsibility of service is often called data ownership. Still, various services in the system need to exchange data they own. Commonly, services expose it via public, synchronous interfaces like REST or gRPC API.
Performing a synchronous request inherently involves a network call. Even in a fast internal network, it takes significantly more time to transmit data than share it in the local process in a monolithic application. Additionally, network roundtrips introduce temporal coupling: both the queried and querying services have to be available for the whole duration of the call. This may impact negatively the overall availability of our system.
Such a way of communication also gives us some benefits. Most importantly we always get the newest version of the data from queried service. This gives us a strongly consistent view of stored information. It’s crucial if we can’t afford to operate on even slightly outdated data. But the biggest flaw of synchronous communication is that it is tricky to scale while retaining acceptable performance. Luckily, we can do better with proper caching.
It’s quite common that some service repeatedly asks another for the same chunks of information. To avoid redundant network calls we can simply put the data in the in-memory, in-process cache. After performing the first call to fetch data the following requests can simply retrieve the cached entity. This caching strategy is called read-through since after the read data goes through the cache.
Although caching greatly decreases the time of subsequent data accesses, it also means that the first readout will still require a slow network call. We also need to consider how to find out whether cached data is still up-to-date. We’ll get to that later.
The cache can be also updated when we push data to another service or save it to the datastore. Intivuitely it makes a lot of sense: if something was recently created or edited, there’s a good chance it would be read back. The approach when cache content is updated after the request (or parallel to it) is called write-through. Write-back strategy means that we first update the cache and only then perform the request to another service to update the entity.
Both solutions will only work if all modifications go through a single service instance. Caches will quickly go out of sync if saves are made on various replicas. This notion of only an individual node being responsible for updates of groups (shards) of certain entities is called the Single Writer Principle (SWP). A notable example of a framework implementing SWP is Akka Cluster.
When we store data in the in-process cache, we increase the memory demand of our service. If the cache has the unbounded size and we never remove items from it, we will quickly get into out-of-memory errors. That’s why caches usually have a limited size and use some kind of eviction strategy in case of overflow:
- Least-recently-used (LRU) is one of the most common strategies. When the cache needs to drop an item, it chooses the one that was accessed least recently. It’s based on real-world observation, that when a particular item was accessed recently it has a high chance of being requested again. In computer science, this phenomenon is called temporal locality of reference.
- The downside of LRU is that it might drop even frequently accessed items when there are many new items cached at once. Even if these new entries won’t be accessed ever again, they will successfully push other hot entries from the cache. LFU (least frequently used) fixes the issue by dropping only items that were used the least regularly.
There are many more possible approaches (like the most recently used - MRU or first-in-first-out - FIFO) which might be appropriate for certain scenarios.
We should always confirm if we’re using the correct caching strategy by checking metrics.
One of the best indicators of cache efficiency is the hits (successful retrievals of data from cache) and miss ratio.
Too many misses might be an indicator of an unsuitable eviction strategy.
Cold and warm cache
An in-memory cache of a newly booted process starts empty (cold). It gets filled only after the application performs a few reads and writes. We say that this way cache warms up. Because of that, a service might be a little bit sluggish on startup and only after some time reach peak performance. A neat solution to this problem is pre-warming: loading content of the cache asynchronously as soon as the service boots up. The catch is that we don’t always know in advance what data should be cached.
Another problem can happen when the service is scaled up to several instances. Every running application process has its independent in-process cache, which inevitably leads to a situation where some entities are cached in one instance, but not in another. So whenever a request would access the wrong instance it will get the cache miss. To avoid this problem we could use sticky sessions or sharding with consistent hashing. An even better solution is a standalone cache.
The standalone cache is a specialized infrastructure component shared among the application’s instances. It solves some of the problems described in the previous paragraph. The cache is separate from the application, so its content is no longer lost after restarts. Since all cached data is shared among the application’s instances, so we no longer need to worry about reaching the right node.
Popular solutions for standalone caches are Redis and Memcached. There are also more specialized variants like in-memory-data-grids (for example Hazelcast or Apache Ignite) which in addition support performing distributed computations on stored data.
Sometimes the amount of data we want to cache is way beyond the capacity of a single machine. By employing sharding we can save terabytes of data in the memory of multiple nodes. This kind of sharding is transparent from the application point of view - the cache is responsible for routing to the instance holding the right entry.
The standalone cache can become a single point of failure if it runs in a single instance. To achieve high availability of the system the cache component should either run as a distributed cluster or have failover nodes. For instance, Redis has support for both clustering (Redis Cluster) and for promoting replicas in case of main node failure (Redis Sentinel).
The standalone cache doesn’t necessarily need to be deployed as a separate component. In containerized systems (like those managed by Kubernetes) It could also be run as a sidecar. The sidecar container is bounded to a specific instance and it shares its lifecycle (like an in-process cache). On the other hand, it gives access to all features offered by cache systems. In a nutshell, it’s a mid-way solution between a full-fledged distributed cache server and an in-memory cache.
The disadvantage of having a cache as a separate component is that it needs to be accessed via a network and therefore it will never be as fast as the in-memory variant. Interestingly, in-memory caches can be distributed as well. A great example is Hazelcast IMDG. It can work as a local cache on individual JVM processes but also replicate data to other instances. This way all reads are handled by the in-memory cache and only writes are burdened with an additional cost of network roundtrips needed for replication.
Cache expiration and invalidation
After some time data in the cache will inevitably drift apart from its source. Sometimes it might be just confusing for users, for example, when they did update but still, see the old version taken from the cache. In worse cases, it can lead to bugs and data corruption.
The most straightforward tactic for getting rid of stale items in the cache is based on time. We need to simply define TTL (time to live) for the items inserted into the cache. After TTL elapses, the item is purged. If there is another request for that item, it is again fetched from the source and then cached again. The whole process is called cache eviction.
This approach has some flaws. If queried data is changing unpredictably (for some entities it’s more dynamic or for others it’s static), it could be challenging to come up with the right value for TTL. If we set up TTL as the too high value we’ll be often serving stale data and if it’s too short, we’ll be evicting items from the cache faster than it’s necessary. Moreover, after the item is removed from the cache, the next request for that entry will have to do a potentially slow network roundtrip to fetch it again.
To avoid this problem we could refresh cached entries periodically by a scheduled job. It is a great way to avoid latency spikes when a hot cache item is evicted. This approach is called refresh-ahead. The drawback is that if data is not changing very often we’ll be doing many redundant requests.
We could also try to make our cache-refreshing procedure a little bit smarter. First of all, queried service should return some hash code calculated from returned data. The recipient could store that data in the cache along with the hash and from now on include it in every request for the same item. Queried service can now use it and check if data has changed. If it did not, the service could simply return an empty response. If the entry changed, it should return a response with data. This way we still need to do a network call, but will be usually very lightweight. HTTP standard has a dedicated mechanism to do this kind of “optimistic” caching (ETag header), but it can be easily implemented with other means of communication, like gRPC.
In contrast, If we update the in-memory cache during writes we have a guarantee that the reads will get the newest version of the item, but this only applies to the instance that does the save (single writer). Reader nodes will still need to invalidate their content by other means. If we’re using a distributed cache it’s also not a problem: all instances will see the current updates. We have to keep in mind that saving something in the database (or performing a request) and updating a cache is not an atomic operation. If service fails in the middle of an update, cache and original data could potentially get out of sync (you can read more about issues of integrating non-transactional systems in my article about messaging).
Caching unavoidably leads to a state of the system where some of its parts are seeing outdated data. They no longer preserve strong consistency guarantees and become eventually consistent. It’s a price paid for superior performance and resilience.
The rule of thumb for microservices architecture is to embrace eventual consistency everywhere it is feasible. Strong, synchronous coupling should be used only where it is necessary.
Invalidation of the items can be also triggered by events. This is quite a sleek solution, especially with event-driven architecture. After the original item changes, all caching services could be notified by a message published by the data owner and eventually refresh their caches.
Sometimes caching other services' responses “as is” is not an optimal solution. Instead, microservices can build complex data models derived from outer data sources. They can gather pieces of information from various parts of the system and save its denormalized version in a local datastore. With an event-driven approach, asynchronous messages can again serve as a way to communicate that the original data was altered. We can think of it as another means of caching but this time data is stored persistently in the querying service’s database.
This article is meant to be an introductory description of various caching-related techniques. I managed to only barely scratch the surface of this complex topic and I leave further exploration to you. The bottom line is that caching can tremendously improve the performance of our system.
It’s a powerful tool, but if used improperly it can cause hard-to-find errors.
I hope you found this article useful. Take care!