Designing a (yet another) retry API
Ox is a toolkit for safe direct-style concurrency and resiliency for Scala on the JVM. We have recently released a new version that addresses the resiliency aspect – by adding a retry mechanism.
In this article, I’d like to walk you through the design process of Ox retries – the goals, the choices and the possible future work.
Background
Failures are inevitable. It’s a no-brainer that you should have proper error handling in your code – but should you give up as soon as an error occurs? Not necessarily, since the failures are often temporary, e.g., a service you’re interacting with might be under high load at the moment, resulting in a timeout, but the load might be likely to decrease shortly – and the service would again respond in a timely manner.
Adding retries – one of the resiliency patterns – allows you to avoid giving up too early, and to make some more attempts before your code ultimately fails. In some scenarios you might want to retry immediately (think a failed transaction in a concurrent environment), while in others (like a timed-out operation) it might be reasonable to wait a bit before the next attempt. When waiting, the subsequent delays can be constant or increase over time (more on this later).
Retries in Ox
SoftwareMill is already the maintainer of a Scala library for retries, named, well, retry. So why, at all, did we reinvent the wheel? The main reason was that the retry library is Future
-oriented, while Ox uses direct-style concurrency. We also decided to make the API a bit different.
Our main goal was for the API to be developer-friendly, i.e., simple and intuitive. There’s a single retry
function:
retry(operation)(policy)
which can alternatively be used as an extension method:
import ox.syntax.*
operation.retry(policy)
Defining the operation
The operation is defined as a by-name parameter in one of the following variants:
- returning a value directly, i.e.,
operation: => T
, - returning a
Try[T]
, i.e.,operation: => Try[T]
, - returning an
Either[E, T]
, i.e.,operation: => Either[E, T]
.
Defining the policy
A retry policy consists of two parts:
- a
Schedule
, which indicates how many times and with what delay should we retry theoperation
after an initial failure, - a
ResultPolicy
, which indicates whether: - a non-erroneous outcome of the operation should be considered a success (if not, the
operation
would be retried) – e.g., an HTTP call might return status code 200, but errors might still be indicated in the response body, - an erroneous outcome of the
operation
should be retried or fail fast.
The available schedules are defined in the Schedule
object. Each schedule has a finite and an infinite variant.
Finite schedules
Finite schedules have a common maxRetries: Int
parameter, which determines how many times the operation
would be retried after an initial failure. This means that the operation could be executed at most maxRetries + 1
times.
Infinite schedules
Each finite schedule has an infinite variant, whose settings are similar to those of the respective finite schedule, but without the maxRetries
setting. Using the infinite variant can lead to a possibly infinite number of retries (unless the operation
starts to succeed again at some point). The infinite schedules are created by calling .forever
on the companion object of the respective finite schedule (see examples below).
Schedule types
The supported schedules (specifically – their finite variants) are:
Immediate(maxRetries: Int)
– retries up tomaxRetries
times without any delay between subsequent attempts.Delay(maxRetries: Int, delay: FiniteDuration)
– retries up tomaxRetries
times , sleeping fordelay
between subsequent attempts.Backoff(maxRetries: Int, initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter)
– retries up tomaxRetries
times , sleeping forinitialDelay
before the first retry, increasing the sleep between subsequent attempts exponentially (with base2
), up to an optionalmaxDelay
(default: 1 minute).
Jitter
In the Backoff
variant, a random factor (jitter) can be optionally used when calculating the delay before the next attempt. The purpose of jitter is to avoid clustering of subsequent retries, i.e., to reduce the number of clients calling a service exactly at the same time, which can result in subsequent failures, contrary to what you would expect from retrying. By introducing randomness to the delays, the retries become more evenly distributed over time. See the AWS Architecture Blog article on backoff and jitter for a more in-depth explanation.
The following jitter strategies are available (defined in the Jitter
enum):
None
– the default one, when no randomness is added, i.e., a pure exponential backoff is used,Full
– picks a random value between0
and the exponential backoff calculated for the current attempt,Equal
– similar toFull
, but prevents very short delays by always using a half of the original backoff and adding a random value between0
and the other half,Decorrelated
– uses the delay from the previous attempt (lastDelay
) and picks a random value between theinitialAttempt
and3 * lastDelay
.
Result policies
A result policy allows to customize how the results of the operation are treated. It consists of two predicates:
isSuccess: T => Boolean
(default:true
) – determines whether a non-erroneous result of theoperation
should be considered a success. When it evaluates totrue
- no further attempts would be made, otherwise - we'd keep retrying.With finite schedules (i.e., those with
maxRetries
defined), ifisSuccess
keeps returningfalse
whenmaxRetries
are reached, the result is returned as-is, even though it's considered unsuccessful.isWorthRetrying: E => Boolean
(default:true
) – determines whether another attempt would be made if the operation results in an errorE
. When it evaluates totrue
- we'd keep retrying, otherwise - we'd fail fast with the error.The
ResultPolicy[E, T]
is generic both over the error (E
) and result (T
) type. Note, however, that for the direct andTry
variants of theoperation
, the error typeE
is fixed toThrowable
, while for theEither
variant,E
can be an arbitrary type.
API shorthands
When you don't need to customize the result policy (i.e., use the default one), you can use one of the following shorthands to define a retry policy with a given schedule (note that the parameters are the same as when manually creating the respective Schedule
):
RetryPolicy.immediate(maxRetries: Int)
,RetryPolicy.immediateForever
,RetryPolicy.delay(maxRetries: Int, delay: FiniteDuration)
,RetryPolicy.delayForever(delay: FiniteDuration)
,RetryPolicy.backoff(maxRetries: Int, initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter)
,RetryPolicy.backoffForever(initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter)
.
If you need to only partially customize the result policy, you can use the following shorthands:
ResultPolicy.default[E, T]
– uses the default settings,ResultPolicy.successfulWhen[E, T](isSuccess: T => Boolean)
– uses the defaultisWorthRetrying
and the providedisSuccess
,ResultPolicy.retryWhen[E, T](isWorthRetrying: E => Boolean)
– uses the defaultisSuccess
and the providedisWorthRetrying
,ResultPolicy.neverRetry[E, T]
– uses the defaultisSuccess
and fails fast on any error.
Examples
Below are some examples of typical use cases. Feel free to experiment and share your feedback! All examples assume the following imports and operations:
import ox.retry
import scala.concurrent.duration.*
def directOperation: Int = ???
def eitherOperation: Either[String, Int] = ???
def tryOperation: Try[Int] = ???
No matter what the operation returns, you can use a unified syntax:
retry(directOperation)(RetryPolicy.immediate(3))
retry(eitherOperation)(RetryPolicy.immediate(3))
retry(tryOperation)(RetryPolicy.immediate(3))
You can use the shorthands to define various schedules while keeping the default result policy:
retry(directOperation)(RetryPolicy.delay(3, 100.millis))
// defaults: maxDelay = 1.minute, jitter = Jitter.None
retry(directOperation)(RetryPolicy.backoff(3, 100.millis))
retry(directOperation)(
RetryPolicy.backoff(3, 100.millis, 5.minutes, Jitter.Equal)
)
Or with the infinite variants:
retry(directOperation)(RetryPolicy.delayForever(100.millis))
retry(directOperation)(
RetryPolicy.backoffForever(100.millis, 5.minutes, Jitter.Full)
)
Finally, you can customize the result policies as well:
// custom success
retry(directOperation)(RetryPolicy(
Schedule.Immediate(3),
ResultPolicy.successfulWhen(_ > 0)
))
// fail fast on certain errors
retry(directOperation)(RetryPolicy(
Schedule.Immediate(3),
ResultPolicy.retryWhen(_.getMessage != "fatal error")
))
retry(eitherOperation)(RetryPolicy(
Schedule.Immediate(3),
ResultPolicy.retryWhen(_ != "fatal error")
))
Feel free to have a look at the tests for full working examples.
Implementation details
There are two small areas worth having a look from the implementation perspective: direct-style concurrency, and stack safety.
Direct-style concurrency
Since Ox leverages the direct style for writing concurrent programs, we’re good to use the good old concurrency primitives. Forget functional effect systems – with direct style, it’s perfectly fine to just use a Thread.sleep
to introduce delay between subsequent retry attempts.
This is thanks to the underlying Java 21 with its Virtual Threads – Thread.sleep
knows if it’s on a platform or on a virtual thread and is going to behave as expected when on a virtual one.
Stack safety
Since we support infinite schedules, there are two ways to implement those: either with a loop, or with recursion.
The latter approach was our choice due to subjectively better readability – however, infinite recursion is tricky, as it leads to a stack overflow sooner or later. This can be solved with tail-call optimization, thanks to which the compiler can rewrite the recursive code into a loop, which doesn’t consume the stack indefinitely.
Tail-call optimization requires the code to have a specific structure (the recursive call must be the last one, or in tail position). In Scala, this is verified at compile time as long as you use the @tailrec
annotation. With our retry implementation, structuring the code properly came at the cost of leaving some of the code duplicated - but we found readability more important.
Future work
The current implementation of retries is more of an MVP, and we already see some features that are missing, but might be desired. Those include:
- multi-policies – so that you’re able to assign different retry strategies to different errors, so that e.g., some errors are retried immediately, but others use exponential backoff,
- side-effecting callbacks for retry attempts – so that you can log errors or update metrics after a failure
- composable policies – so that you can fall back to another policy when the original one eventually fails - a possible use case might be retrying immediately a couple of times before introducing a delay between subsequent attempts,
- a syntax for repeats – so that you can run a succeeding operation multiple times according to a
Schedule
, with a possible additional stop condition.
Your feedback is invaluable and would help us make Ox truly developer-friendly. Feel free to share your thoughts and suggestions – softwaremill.community and GitHub issues are great ways to do so.