Contents

Designing a (yet another) retry API

Jacek Kunicki

15 Dec 2023.8 minutes read

Designing a (yet another) retry API webp image

Ox is a toolkit for safe direct-style concurrency and resiliency for Scala on the JVM. We have recently released a new version that addresses the resiliency aspect – by adding a retry mechanism.

In this article, I’d like to walk you through the design process of Ox retries – the goals, the choices and the possible future work.

Background

Failures are inevitable. It’s a no-brainer that you should have proper error handling in your code – but should you give up as soon as an error occurs? Not necessarily, since the failures are often temporary, e.g., a service you’re interacting with might be under high load at the moment, resulting in a timeout, but the load might be likely to decrease shortly – and the service would again respond in a timely manner.

Adding retries – one of the resiliency patterns – allows you to avoid giving up too early, and to make some more attempts before your code ultimately fails. In some scenarios you might want to retry immediately (think a failed transaction in a concurrent environment), while in others (like a timed-out operation) it might be reasonable to wait a bit before the next attempt. When waiting, the subsequent delays can be constant or increase over time (more on this later).

Retries in Ox

SoftwareMill is already the maintainer of a Scala library for retries, named, well, retry. So why, at all, did we reinvent the wheel? The main reason was that the retry library is Future-oriented, while Ox uses direct-style concurrency. We also decided to make the API a bit different.

Our main goal was for the API to be developer-friendly, i.e., simple and intuitive. There’s a single retry function:

retry(operation)(policy)

which can alternatively be used as an extension method:

import ox.syntax.*

operation.retry(policy)

Defining the operation

The operation is defined as a by-name parameter in one of the following variants:

  • returning a value directly, i.e., operation: => T,
  • returning a Try[T], i.e., operation: => Try[T],
  • returning an Either[E, T], i.e., operation: => Either[E, T].

Defining the policy

A retry policy consists of two parts:

  • a Schedule, which indicates how many times and with what delay should we retry the operation after an initial failure,
  • a ResultPolicy, which indicates whether:
    • a non-erroneous outcome of the operation should be considered a success (if not, the operation would be retried) – e.g., an HTTP call might return status code 200, but errors might still be indicated in the response body,
    • an erroneous outcome of the operation should be retried or fail fast.

The available schedules are defined in the Schedule object. Each schedule has a finite and an infinite variant.

Finite schedules
Finite schedules have a common maxRetries: Int parameter, which determines how many times the operation would be retried after an initial failure. This means that the operation could be executed at most maxRetries + 1 times.

Infinite schedules
Each finite schedule has an infinite variant, whose settings are similar to those of the respective finite schedule, but without the maxRetries setting. Using the infinite variant can lead to a possibly infinite number of retries (unless the operation starts to succeed again at some point). The infinite schedules are created by calling .forever on the companion object of the respective finite schedule (see examples below).

Schedule types
The supported schedules (specifically – their finite variants) are:

  • Immediate(maxRetries: Int) – retries up to maxRetries times without any delay between subsequent attempts.
  • Delay(maxRetries: Int, delay: FiniteDuration) – retries up to maxRetries times , sleeping for delay between subsequent attempts.
  • Backoff(maxRetries: Int, initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter) – retries up to maxRetries times , sleeping for initialDelay before the first retry, increasing the sleep between subsequent attempts exponentially (with base 2), up to an optional maxDelay (default: 1 minute).

Jitter
In the Backoff variant, a random factor (jitter) can be optionally used when calculating the delay before the next attempt. The purpose of jitter is to avoid clustering of subsequent retries, i.e., to reduce the number of clients calling a service exactly at the same time, which can result in subsequent failures, contrary to what you would expect from retrying. By introducing randomness to the delays, the retries become more evenly distributed over time. See the AWS Architecture Blog article on backoff and jitter for a more in-depth explanation.

The following jitter strategies are available (defined in the Jitter enum):

  • None – the default one, when no randomness is added, i.e., a pure exponential backoff is used,
  • Full – picks a random value between 0 and the exponential backoff calculated for the current attempt,
  • Equal – similar to Full, but prevents very short delays by always using a half of the original backoff and adding a random value between 0 and the other half,
  • Decorrelated – uses the delay from the previous attempt (lastDelay) and picks a random value between the initialAttempt and 3 * lastDelay.

Result policies
A result policy allows to customize how the results of the operation are treated. It consists of two predicates:

  • isSuccess: T => Boolean (default: true) – determines whether a non-erroneous result of the operation should be considered a success. When it evaluates to true - no further attempts would be made, otherwise - we'd keep retrying.

    With finite schedules (i.e., those with maxRetries defined), if isSuccess keeps returning false when maxRetries are reached, the result is returned as-is, even though it's considered unsuccessful.

  • isWorthRetrying: E => Boolean (default: true) – determines whether another attempt would be made if the operation results in an error E. When it evaluates to true - we'd keep retrying, otherwise - we'd fail fast with the error.

    The ResultPolicy[E, T] is generic both over the error (E) and result (T) type. Note, however, that for the direct and Try variants of the operation, the error type E is fixed to Throwable, while for the Either variant, E can be an arbitrary type.

API shorthands
When you don't need to customize the result policy (i.e., use the default one), you can use one of the following shorthands to define a retry policy with a given schedule (note that the parameters are the same as when manually creating the respective Schedule):

  • RetryPolicy.immediate(maxRetries: Int),
  • RetryPolicy.immediateForever,
  • RetryPolicy.delay(maxRetries: Int, delay: FiniteDuration),
  • RetryPolicy.delayForever(delay: FiniteDuration),
  • RetryPolicy.backoff(maxRetries: Int, initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter),
  • RetryPolicy.backoffForever(initialDelay: FiniteDuration, maxDelay: FiniteDuration, jitter: Jitter).

If you need to only partially customize the result policy, you can use the following shorthands:

  • ResultPolicy.default[E, T] – uses the default settings,
  • ResultPolicy.successfulWhen[E, T](isSuccess: T => Boolean) – uses the default isWorthRetrying and the provided isSuccess,
  • ResultPolicy.retryWhen[E, T](isWorthRetrying: E => Boolean) – uses the default isSuccess and the provided isWorthRetrying,
  • ResultPolicy.neverRetry[E, T] – uses the default isSuccess and fails fast on any error.

Examples

Below are some examples of typical use cases. Feel free to experiment and share your feedback! All examples assume the following imports and operations:

import ox.retry
import scala.concurrent.duration.*

def directOperation: Int = ???
def eitherOperation: Either[String, Int] = ???
def tryOperation: Try[Int] = ???

No matter what the operation returns, you can use a unified syntax:

retry(directOperation)(RetryPolicy.immediate(3))

retry(eitherOperation)(RetryPolicy.immediate(3))

retry(tryOperation)(RetryPolicy.immediate(3))

You can use the shorthands to define various schedules while keeping the default result policy:

retry(directOperation)(RetryPolicy.delay(3, 100.millis))

// defaults: maxDelay = 1.minute, jitter = Jitter.None
retry(directOperation)(RetryPolicy.backoff(3, 100.millis))

retry(directOperation)(
  RetryPolicy.backoff(3, 100.millis, 5.minutes, Jitter.Equal)
)

Or with the infinite variants:

retry(directOperation)(RetryPolicy.delayForever(100.millis))

retry(directOperation)(
  RetryPolicy.backoffForever(100.millis, 5.minutes, Jitter.Full)
)

Finally, you can customize the result policies as well:

// custom success
retry(directOperation)(RetryPolicy(
  Schedule.Immediate(3), 
  ResultPolicy.successfulWhen(_ > 0)
))

// fail fast on certain errors
retry(directOperation)(RetryPolicy(
  Schedule.Immediate(3), 
  ResultPolicy.retryWhen(_.getMessage != "fatal error")
))

retry(eitherOperation)(RetryPolicy(
  Schedule.Immediate(3), 
  ResultPolicy.retryWhen(_ != "fatal error")
))

Feel free to have a look at the tests for full working examples.

Implementation details

There are two small areas worth having a look from the implementation perspective: direct-style concurrency, and stack safety.

Direct-style concurrency

Since Ox leverages the direct style for writing concurrent programs, we’re good to use the good old concurrency primitives. Forget functional effect systems – with direct style, it’s perfectly fine to just use a Thread.sleep to introduce delay between subsequent retry attempts.

This is thanks to the underlying Java 21 with its Virtual ThreadsThread.sleep knows if it’s on a platform or on a virtual thread and is going to behave as expected when on a virtual one.

Stack safety

Since we support infinite schedules, there are two ways to implement those: either with a loop, or with recursion.

The latter approach was our choice due to subjectively better readability – however, infinite recursion is tricky, as it leads to a stack overflow sooner or later. This can be solved with tail-call optimization, thanks to which the compiler can rewrite the recursive code into a loop, which doesn’t consume the stack indefinitely.

Tail-call optimization requires the code to have a specific structure (the recursive call must be the last one, or in tail position). In Scala, this is verified at compile time as long as you use the @tailrec annotation. With our retry implementation, structuring the code properly came at the cost of leaving some of the code duplicated - but we found readability more important.

Future work

The current implementation of retries is more of an MVP, and we already see some features that are missing, but might be desired. Those include:

  • multi-policies – so that you’re able to assign different retry strategies to different errors, so that e.g., some errors are retried immediately, but others use exponential backoff,
  • side-effecting callbacks for retry attempts – so that you can log errors or update metrics after a failure
  • composable policies – so that you can fall back to another policy when the original one eventually fails - a possible use case might be retrying immediately a couple of times before introducing a delay between subsequent attempts,
  • a syntax for repeats – so that you can run a succeeding operation multiple times according to a Schedule, with a possible additional stop condition.

Your feedback is invaluable and would help us make Ox truly developer-friendly. Feel free to share your thoughts and suggestions – softwaremill.community and GitHub issues are great ways to do so.

Reviewed by: Michał Matłoka

Blog Comments powered by Disqus.