Contents

What is blocking in Loom?

Adam Warski

10 Jul 2023.15 minutes read

What is blocking in Loom? webp image

Project Loom introduces the concept of Virtual Threads to Java's runtime and will be available as a stable feature in JDK 21 in September. Project Loom aims to combine the performance benefits of asynchronous programming with the simplicity of a direct, "synchronous" programming style.

To achieve the performance goals, any blocking operations need to be handled by Loom's runtime in a special way. Let's investigate how this special handling works and if there are any corner cases when programming using Loom.

Before we begin, it might be helpful to define what we mean by blocking: it's the state of our program's thread, where it doesn't perform any meaningful work (doesn't consume the CPU) but waits for some external condition to occur. Examples include waiting on a semaphore or for data to become available in a socket.

Carrier & virtual threads

As a start, here's a short introduction to the main concepts of Loom. At the basis, we've got platform threads—also known as kernel threads. These are the threads that have been present in Java for a long time; up until now, each running Thread instance corresponds to a single kernel thread. These threads are heavy-weight, expensive to create, and switch between. They are a scarce resource that needs to be carefully managed, e.g., by using a thread pool.

Virtual threads are a new concept that Loom introduces. They are lightweight and cheap to create, both in terms of memory and the time needed to switch contexts. In a JDK with virtual threads enabled, a Thread instance can represent either a platform thread or a virtual one. The API is the same—but the cost of running each varies significantly.

Behind the scenes, the JVM+Loom runtime keeps a pool of platform threads, called carrier threads, on top of which virtual threads are multiplexed. That is, a small number of platform threads is used to run many virtual threads. Whenever a virtual thread invokes a blocking operation, it should be "put aside" until whatever condition it's waiting for is fulfilled, and another virtual thread can be run on the now-freed carrier thread.

The idea is not new: it's what has been implemented using thread pools and non-blocking I/O in various reactive libraries. However, here this is done at the level of the runtime instead of a library, which gives us much cleaner syntax, useful stack traces and avoids the virality of, e.g., the Future wrapper.

We'll stop here when it comes to introducing Loom; if you'd like a deeper dive, I can recommend The Ultimate guide to Virtual Threads on Rock the JVM's blog.

One important thing is that for a system to make steady progress (when a larger number of virtual threads are used), the carrier threads have to become free frequently so that virtual threads might be scheduled onto them. Hence, the biggest gains should be seen in I/O-heavy systems, while CPU-heavy applications won't see much improvement from using Loom.

That's under the assumption that any blocking operation frees up the carrier thread so that another virtual thread can take over. But is that so?

Retrofitting blocking

To implement virtual threads, as mentioned above, a large part of Project Loom's contribution is retrofitting existing blocking operations so that they are virtual-thread-aware. That way, when they are invoked, they free up the carrier thread to make it possible for other virtual threads to resume.

For example, whenever code blocks on a semaphore, lock, or another Java concurrency primitive, this won't block the underlying carrier thread but only signal to the runtime that it should capture the continuation of the current virtual thread, put it in a waiting queue and resume once the condition on which the blocking happened is resolved.

Similarly, Thread.sleep now only blocks the virtual thread, not the carrier thread. This means that the code below (also on GitHub) will complete in 4 seconds, even though there are only about 15 platform threads that are started (the size of the carrier thread pool used to run the virtual threads depends on the number of cores available to the JVM):

var e = Executors.newVirtualThreadPerTaskExecutor();
for (int i = 0; i < 1000; i++) {
    e.submit(() -> { sleep(4000); });
}
e.shutdown();
e.awaitTermination(1, TimeUnit.DAYS);

Synchronization problems

However, not all Java constructs have been retrofitted that way. Such operations include synchronized methods and code blocks. Using them causes the virtual thread to become pinned to the carrier thread. When a thread is pinned, blocking operations will block the underlying carrier thread—precisely as it would happen in pre-Loom times.

For example:

var e = Executors.newVirtualThreadPerTaskExecutor();
for (int i = 0; i < 1000; i++) {
    e.submit(() -> { new Test().test(); });
}
e.shutdown();
e.awaitTermination(1, TimeUnit.DAYS);

class Test {
    synchronized void test() {
        sleep(4000);
    }
}

Note that each submitted task creates a new instance of Test, and hence uses a different monitor for the synchronized void test() method invocation. However, because we use synchronized, the thread becomes pinned to the carrier thread. If the carrier thread pool has 5 threads, only five tasks will run simultaneously (as they will block the entire pool). Hence, this code, although not that much different from what we've seen before, will take 1000/5*4 = 800 seconds to complete.

The solution is to use locks instead of the synchronized keyword:

var e = Executors.newVirtualThreadPerTaskExecutor();
for (int i = 0; i < 1000; i++) {
    e.submit(() -> { new Test().test(); });
}
e.shutdown();
e.awaitTermination(1, TimeUnit.DAYS);

class Test {
    private ReentrantLock lock = new ReentrantLock();
    void test() {
        lock.tryLock();
        try {
            sleep(4000);
        } finally {
            lock.unlock();
        }
    }
}

This will once again complete in 4 seconds when using Loom.

Replacing synchronized blocks with locks within the JDK (where possible) is one more area that is in the scope of Project Loom and what will be released in JDK 21. These changes are also what various Java and JVM libraries already implemented or are in the process of implementing (e.g., JDBC drivers). However, application code that uses synchronized will need extra care.

Sockets

Are there more areas such as synchronized that cause thread pinning and hence might be detrimental to the performance of our application?

Luckily, one of the most common sources of I/O—network communication using TCP sockets—is fully Loom-aware. This shouldn't be a surprise: socket communication can be performed in a non-blocking manner (either completely non-blocking or using a single "manager" thread which manages the execution of multiple I/O tasks) with mechanisms such as select() or epoll().

And that's what Project Loom uses under the hood to provide a virtual-thread-friendly implementation of sockets. The non-blocking I/O details are hidden, and we get a familiar, synchronous API. A full example of using a java.net.Socket directly would take a lot of space, but if you're curious here's an example which runs multiple requests concurrently, calling a server which responds after 3 seconds.

If you look closely, you'll see InputStream.read invocations wrapped with a BufferedReader, which reads from the socket's input. That's the blocking call, which causes the virtual thread to become suspended. Using Loom, the test completes in 3 seconds, even though we only ever start 16 platform threads in the entire JVM and run 50 concurrent requests.

Things are different, however, with datagram sockets (using the UDP protocol). Once again, we've got a server that responds with an UDP packet after 3 seconds, but if we try to run the client, you'll see that DatagramSocket.receive causes thread-pinning and hence only a handful of clients (up to the size of the carrier thread pool) will make progress at any point in time. The entire test will thus take longer.

Hence we've encountered at least two sources of thread pinning: synchronized blocks and datagram sockets. Are there more?

Files & DNS

Yes, there are more APIs that cause thread pinning and blocking of the carrier thread, namely all file operations (such as reading from a FileInputStream, writing to a file, listing directories, etc.), as well as resolving domain names to IP addresses using InetSocketAddress.

That might look worrying, but Loom does take some remedial steps. If you take a look at the source code of FileInputStream, InetSocketAddress or DatagramSocket, you'll notice usages of the jdk.internal.misc.Blocker class. Invocations to its begin()/end() methods surround any carrier-thread-blocking calls.

The JavaDoc reveals its purpose:

/**
 * Defines static methods to mark the beginning and end of a possibly blocking
 * operation. The methods are intended to be used with try-finally as follows:
 * {@snippet lang=java :
 *  long comp = Blocker.begin();
 *  try {
 *    // blocking operation
 *  } finally {
 *    Blocker.end(comp);
 *  }
 * }
 * If invoked from a virtual thread and the underlying carrier thread is a
 * CarrierThread then the code in the block runs as if it were in run in
 * ForkJoinPool.ManagedBlocker. This means the pool can be expanded to support
 * additional parallelism during the blocking operation.
 */
public class Blocker

In other words, the carrier thread pool might be expanded when a blocking operation is encountered to compensate for the thread-pinning that occurs. A new carrier thread might be started, which will be able to run virtual threads.

By the way: looking at usages of the Blocker class is also a good way to discover which APIs cause thread-pinning using a particular version of the JDK.

Discovering pinned threads and carrier pool configuration

If thread pinning causes discomfort, you are not alone—indeed, that's something to look out for. Luckily, the JVM allows us to report any instances of this happening using the -Djdk.tracePinnedThreads=short flag (there's also a full option that gives the complete stack trace).

This gives us messages such as:

Thread[#91,ForkJoinPool-1-worker-7,5,CarrierThreads]
    java.base/sun.nio.ch.DatagramSocketAdaptor.receive(DatagramSocketAdaptor.java:241) <== monitors:1

Moreover, you can control the initial and maximum size of the carrier thread pool using the jdk.virtualThreadScheduler.parallelism, jdk.virtualThreadScheduler.maxPoolSize and jdk.virtualThreadScheduler.minRunnable configuration options. These are directly translated to constructor arguments of the ForkJoinPool.

While this won't let you avoid thread pinning, you can at least identify when it happens and if needed, adjust the problematic code paths accordingly.

Are we doomed to blocking?

In a way, yes—some operations are inherently blocking due to how our operating systems are designed.

For example, page faults: when our application references memory, which is swapped out to disk, this will cause a page fault. The operating system must fetch the memory content from the disk, and only then our application might continue. While this fetching takes place, the thread accessing a given page is blocked. This is outside the control of the JVM or any other runtime. Such low-level memory management is the sole responsibility of the kernel and is completely transparent to the application. Hence, we won't be able to entirely avoid blocking, even if it's only in the form of page faults.

Secondly, what about files? For the kernel, reading from a socket might block, as data in the socket might not yet be available (the socket might not be "ready"). When we try to read from a socket, we might have to wait until data arrives over the network. The situation is different with files, which are read from locally available block devices. There, data is always available; it might only be necessary to copy the data from the disk to the memory.

In a way, from the kernel's perspective, file operations never block in a way that socket operations do. Reading files is somewhat similar to page faults. And because of that, all kernel APIs for accessing files are ultimately blocking (in the sense we defined at the beginning).

select, which solved our problems with sockets, won't help when used with descriptors corresponding to local files: they will immediately be reported as "ready". As for epoll/aio, you can't use these APIs with local files at all. There's nothing Java, or any other runtime can do to make these calls non-blocking.

Will io_uring save the day?

You might have heard about io_uring: a new addition to the kernel which specifically targets providing a uniform asynchronous API for all kinds of I/O operations, including, first and foremost, operations on local files. Can't Loom use that?

It can, and it probably will (probably only for local files, as io_uring's performance gains over epoll aren't consistent, and the implementation itself frequently has security vulnerabilities). However, this only shifts the problem instead of fully solving it.

From the application's perspective, we get a non-blocking, asynchronous API for file access. But if we look at what happens under the covers in io_uring, we'll discover that it manages a thread pool for blocking operations, such as these on local files. Hence instead of running compensating threads in the JVM, we'll get threads run and managed by io_uring.

This has some configuration implications. If you'd like to set an upper bound on the number of kernel threads used by your application, you'll now have to configure both the JVM with its carrier thread pool, as well as io_uring, to cap the maximum number of threads it starts. Luckily, there's a great article describing how to do precisely that.

Do we want a uniform I/O interface?

We can see now that not all I/O operations are the same: the two most common ones, operating on network sockets and local files, have different behavior at the kernel level. Initiatives such as io_uring might try to provide a uniform interface, but deep down, one type of I/O is non-blocking, while the other is, and will be run on a thread pool.

Since the characteristics of both types of I/O operations differ, does a uniform interface make sense? Isn't it a leaky abstraction?

There's an interesting Mastodon thread on exactly that topic by Daniel Spiewak. Daniel argues that because the blocking behavior is different in the case of files and sockets, this should not be hidden behind an abstraction layer such as io_uring or Loom's virtual threads but instead exposed to the developer. That's because their usage patterns should be different, and any blocking calls should be batched & protected using a gateway, such as with a semaphore or a queue.

On the other hand, I would argue that even if I/O is non-blocking, such as in the case of sockets, it's still not free. It might be cheaper to use than blocking I/O, but in our code, we should properly gate usage of all kinds of I/O. The specific limits on how much concurrency we allow for each type of operation might be different, but they still should be there.

It's crucial to know, of course, what are the characteristics of each type of I/O operation in advance. Hence, Daniel is right that creating such a uniform interface can make things harder: it's not apparent at first glance (e.g., no help from the type system) how a given operation might behave at runtime.

How much control do we want?

Daniel also correctly observes that this blocking/non-blocking distinction only really matters in high-load scenarios: but that might coincide with the scale where async or Loom/io_uring matter at all. And that's where you might want to know exactly what the consequences of invoking a given method might be.

He finishes with a philosophical question as to how strictly the runtime should adhere to the user's requests. Should we be able to say: "run this I/O operation in a non-blocking way"—and if that's not possible (e.g., because we are operating on a local file)—should the entire invocation fail? Or should the runtime always strive to complete our requests, regardless of the cost?

The answer here might be some kind of hierarchical capability system—so that we might have both fine-grained BlockingIO and NonBlockingIO capabilities, as well as more coarse-grained IO ones. That way, the type system could guide us so that it would be possible to properly handle each case: both its failure modes and runtime characteristics.

Summing up

There has been a couple of things that we've learned along the way:

  • The distinction between carrier and virtual threads and that Loom runs many virtual threads on a small pool of carrier threads.
  • Loom includes retrofitting blocking operations so that they are virtual-thread-aware.
  • Concurrency primitives, such as locks, semaphores, and socket-based networking, have been migrated to use virtual threads properly.
  • However, that's not true for all blocking APIs in the JDK. When a carrier thread is blocked, we deal with thread pinning.
  • Thread pinning occurs when using the synchronized keyword and when doing file I/O, using datagram sockets, or resolving domain names.
  • Some solutions help avoid thread pinning: migrating from synchronized to using locks and spawning compensating carrier threads. The latter is done automatically (up to the configured limit) by the JDK when a carrier-thread blocking I/O operation is invoked.
  • Some operations are inherently blocking at the kernel level (such as file I/O), and it won't be possible for Loom to implement them in a non-blocking way.
  • io_uring is a partial solution to the problem, as internally, it uses a thread pool for running blocking operations (such as file I/O).

We finished with an open question: given that no runtime or API can truly provide universally non-blocking I/O, is an abstraction that tries to provide a uniform interface for all blocking and non-blocking operations a good idea?

The answer might as well be, "it depends". At a certain scale, these things start to matter. At smaller scales, they might not matter at all. Abstractions such as Loom or io_uring are leaky and might be misleading. Maybe it would be good to have some guidance from the type system. Or to convey this kind of information in another way? Finally, we might want to have a way to instruct our runtimes to fail if an I/O operation can't be run in a given way.

Loom does push the JVM forward significantly, and delivers on its performance goals, along with a simplified programming model; but we can't blindly trust it to remove all sources of kernel thread blocking from our applications. Potentially, this might lead to a new source of performance-related problems in our applications, while solving other ones.

What are your thoughts on the subject—should blocking and non-blocking operations be distinguished? If so, how—at the type level, using a naming convention, or maybe otherwise? What kind of configuration should our runtimes provide so that we can be sure how a given I/O operation will be executed?

Blog Comments powered by Disqus.