Please pick a number webp image

On randomness in data

Picking a random number might seem to be a no-brainer for us, humans. We just close our eyes and tell the first number that comes to our minds. But is this really the case for computers? By design, they should be pretty predictable. With the same input data and program, the computer should always yield identical results.

Yet, randomness is very crucial in IT. For instance, random data is used to produce keys of certificates or access tokens. If the generated content could be predicted by attackers, then the application security would be compromised.

So how is random data actually generated?

Pseudorandom numbers

You might have heard about the pseudorandom number generators (PRNG). PRNG is a deterministic algorithm that can generate a sequence of numbers that looks like they’re random.

The PRNG is initialized with the initial value called the seed. PRNG booted up with a certain seed will always yield the same series of numbers. This is sometimes a very handful property, for instance during unit testing. If some input data generated by PRNG causes test failure, we can deterministically reproduce the error by passing the same seed.

Numbers derived from PRNGs need to have appropriate statistical properties. In practice, this is not always the case. An example of the shortcoming of the algorithm could be an unwanted correlation or uneven distribution of generated numbers.

In order for the PRNG algorithm to be suitable for cryptographic purposes, it needs to pass restrictive statistical tests. A requirement for a cryptographically-safe pseudorandom generator is that an attacker has only a negligible advantage in distinguishing the generator's output sequence from a truly random sequence.

Still, values generated by such a source are only safe if the adversary doesn’t know the initial value. Using predictable seeds can cause serious security problems. One example of such vulnerability comes from the early days of the Internet. Back in 1994, SSL encryption employed by the Netscape Navigator browser utilized numbers generated by the PRNG. The implementation seeded the algorithm from three sources: an ID of the process, an ID of the parent process, and the current time. These values turned out to be easily guessable. By figuring out the initial value used in PRNG, a potential attacker was able to decrypt the traffic.

So it seems that in order to safely initialize a random generator we need a random number first. So how can we get it?

Cryptographically-safe random generators

Interestingly, human activity is a very efficient source of randomness. Mouse movements or keyboard keystrokes happen in irregular intervals, so if measured, they can be the origin of random values. Linux records all user interactions, turns them into bytes, and puts them in the so-called entropy pool. As a rule of thumb, we can say that the more entropy is gathered in the pool, the more robust the random generator becomes.

The random bytes can be later read by accessing the special file /dev/random. If the pool hasn’t gathered sufficient entropy, the read from /dev/random will block. This can happen when the system was just restarted and wasn’t yet able to fill the pool or values are being read faster than they’re produced and the pool is depleted.

Linux also offers a non-blocking alternative called /dev/urandom (unlimited random), which reuses the internal pool to produce more pseudo-random bits. This means that the read from urandom will not block, but the output may contain less entropy. In theory, it is less safe, but at the time I write this article, there’s no practical implementation of the attack. There’s also /dev/arandom, which only blocks until there’s enough entropy to initialize the seed value and then never blocks again because the following bytes are generated using the PRNG algorithm.

Random values can also be created via hardware sources, so-called hardware random number generators (HRNG) or true random number generators (TRNG). It turns out that it’s just enough to observe physical events that are happening all around us to get random samples of data. It doesn’t matter whether it is the electronic current noise or the jitter of the clock chip, if it’s hard to predict, then it’s a viable source of randomness. So the HRNG’s job is just to measure any of these natural phenomena and fill its internal entropy pool. Intel’s Ivy Bridge processors provide built-in HRNG, which samples the thermal noise of the processor. The values can be then retrieved using RDRAND or RDSEED instructions.

Another amazing example of using measurements of physical events to gather entropy is applied in Cloudflare. They use a set of lava lamps standing on shelves. The camera takes a photo of the wall at a scheduled interval and then the digitized image is used as a source of random bytes. Since the movement of the fluids inside the lamp is impossible to predict, they tend to be a very efficient source of randomness. Cloudflare’s employees call it a “wall of entropy”.

HRNGs are very useful in environments where it is very little or even no noise coming from users’ interactions, like servers. They tend to be slower than PRNGs, so very often a value coming from the hardware generator is only used to seed the pseudorandom algorithm, especially when a high volume of random bytes is required. To increase safety, seeds can be rotated every once in a while.

Why should we care?

It might seem that the issue of ensuring the proper quality of random data doesn’t concern us unless we directly deal with cryptography, but it is not necessarily the case.

Using safe random generators might be important if you’re generating any value that should be not guessable, like a token for resetting the password. By using weak PRNGs or predictable seeds, you can introduce security vulnerabilities to your web app. For that reason, for such applications, remember to use in Java, os.urandom() in Python, or similar functions in other languages.

Take care!

Blog Comments powered by Disqus.