Contents

Contents

The Long Journey - How Byte Code Gets Interpreted by the Machine

The Long Journey - How Byte Code Gets Interpreted by the Machine - article cover

I used to hear in the past that the JIT (Just In Time) compiler can compile bytecode on multiple levels. However, I have never thought about what it means in practice. In the middle of October, I was attending a conference where I had the pleasure of listening to a talk about ‘lock-free programming'. During this talk, the speakers briefly mentioned the JVM internals. This inspired me to dig deeper into this topic. As one of the best ways to check if you understand any topic well is to try to explain it to others, I decided to write this article.

Tiered compilation

One of the main differences between Java and languages like C++ is how they are compiled. A C++ compiler translates code directly into machine code that the CPU can execute. In contrast, Java compiles its code into an intermediate form known as bytecode. This bytecode cannot be executed directly by the CPU. It requires additional processing. The JVM serves as an application that translates this bytecode into machine code, enabling it to be executed by the CPU.

It's important to note that the code we write in any JVM-compatible language serves as a recipe for how we want our program to execute.

The JVM can compile our bytecode at five different levels. I will illustrate this with an image:

five levels of compilation in JVM

I will take a very simple method as an example to illustrate the difference between each level.

Bytecode (.class)

This is the definition of a method in Java:

private static int resolveNumber(int i) {
    if (i < 20000) return i;
    else return i * 2;
}

After compilation, we can examine the generated bytecode using the following command:

javap -c -p target/classes/org/zygiert/Main.class
  • javap is a Java Class File Disassembler that comes with JDK.
  • The -c option is used to print disassembled code.
  • The -v option allows to display private methods as well (this code piece is a fragment of Main class).

The produced bytecode is a set of operation codes for the JVM that specify the operations to be executed. Below is the bytecode representation of the method:

         0: iload_0
         1: sipush        20000
         4: if_icmpge     9
         7: iload_0
         8: ireturn
         9: iload_0
        10: iconst_2
        11: imul
        12: ireturn

Here's what actually happens:

  1. iload_0 loads an integer from a local variable to the stack.
  2. sipush pushes the short integer 20000 onto the stack.
  3. if_icmpge compares the two integer values. If the first integer is greater than or equal to 20000, it jumps to the instruction at byte offset 9. If not, it continues to the next instruction.

If the integer is less than 20000:

  • iload_0 loads parameter again onto the stack.
  • ireturn returns the integer from the stack.

If the integer is greater than or equal to 20000:

  • iload_0 parameter is loaded again onto the stack.
  • iconst_2 pushes the constant integer 2 onto the stack.
  • imul multiplies the two topmost integers on the stack, leaving the result on the stack.
  • ireturn returns the resulting integer from the stack.

One word of clarification: the numbers on the left are not instruction numbers; they are byte offsets. For example, the iload_0 instruction, which has a byte offset of 0, requires only 1 byte for its operation code. Therefore, immediately following it, at byte offset 1, is the sipush operation code. The sipush instruction needs 3 bytes: 1 for the operation code and 2 additional bytes for the short integer that it pushes onto the stack. This explains why the if_icmpge operation code has a byte offset of 4.

Here are just a few examples of available bytecode operation codes. You can find the full list of them in the JVM specification. However, it's important to note that this is merely an intermediate representation of our code, which must be translated into machine code. This is where the interpreter comes into play.

Interpreter

Nowadays, the JVM uses a template interpreter to convert each bytecode instruction into machine code when an application starts. This process occurs every time you launch the application. After the application starts, the template interpreter generates native code for each bytecode instruction based on several factors, such as the CPU architecture and the underlying operating system. This native code is then stored in the code cache.

To check how long it takes to generate native code, you can run the following command:

java -Xlog:startuptime -jar target/app.jar 10

In the printed log, you will find a line similar to this:

[0.019s][info][startuptime] Interpreter generation, 0.0014898 secs

As we can see, the interpreter generation process is quite fast.

From this point, the JVM has compiled native code linked to each bytecode instruction. Consequently, when a method is executed in this state, the JVM iterates through the bytecode and executes the corresponding native code for each instruction. This approach allows our application to start operating relatively quickly, but it is not the most optimal way.

If you wish to analyze the native code generated by the interpreter, you can run the following command:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintInterpreter -jar target/app.jar 10

This command will produce a lengthy output, but it includes operation code names for each piece of native code, making it easier to identify what you are looking for. Here is an example of iload_0 operation code:

iload_0  26 iload_0  [0x000071666f9465e0, 0x000071666f946628]  72 bytes
[MachCode]
  0x000071666f9465e0: 4883 ec08 | c5fa 1104 | 24eb 1f48 | 83ec 10c5 | fb11 0424 | eb14 4883 | ec10 4889 | 0424 48c7 
  0x000071666f946600: 4424 0800 | 0000 00eb | 0150 418b | 0641 0fb6 | 5d01 49ff | c549 ba80 | 1237 8666 | 7100 0041 
  0x000071666f946620: ff24 da0f | 1f44 0000 
[/MachCode]

It is also worth mentioning that there is still a C++ interpreter within the JDK, although it is not used by default in modern versions of the JDK.

C1 compiler

The C1 compiler is part of the JIT compilation process, also known as the client compiler. It activates when the JVM detects that certain methods are executed frequently through a profiling mechanism. The C1 compiler operates at three different levels.

Tier 1

This tier is reserved for trivial methods, such as getters, setters, or methods that return constant values. Optimizing these methods further would not be beneficial. At this tier, no profiling data is available, as it is unnecessary.

Tier 2

At this level, the compiler uses basic profiling information associated with the method. As a result, the compiled method is smaller in size compared to one compiled at Tier 3. Typically, when the JVM detects that a method is hot, it compiles it to Tier 3 and then to Tier 4.

It's important to clarify what defines a hot method. A method becomes hot in two ways:

  1. Its invocation count exceeds the TierXInvocationThreshold multiplied by CompileThresholdScaling.
  2. Its invocation count exceeds the TierXMinInvocationThreshold multiplied by CompileThresholdScaling and the sum of its invocation count with the number of iterations of loops within it exceeds the TierXCompileThreshold multiplied by CompileThresholdScaling.

However, if the compilation queue for Tier 4 is excessively long, it may not be efficient to wait. Additionally, keeping a method compiled at Tier 3 can be inefficient due to the increased size and complexity associated with the additional profiling data. In such cases, the method may be recompiled to Tier 2, and it could later be compiled again to Tier 3 and then to Tier 4.

Tier 3

After a method is identified as hot, it is usually compiled to Tier 3, which includes full profiling data. This Tier often serves as a temporary step before progressing to Tier 4. However, as mentioned earlier, there are cases where the method may be compiled to Tier 2 instead of advancing directly to Tier 4.

Check compiled code details

Previously, I mentioned that methods compiled by the C1 compiler differ based on the amount of profiling data they receive, as well as in size. Now, let's explore this in practice.

To ensure the JVM stops compilation at a specific Tier and prints information about method compilation, you can use the following command to target the C1 Tier:

java -XX:+PrintCompilation -XX:TieredStopAtLevel=1 -jar target/app.jar 15000
  • -XX:+PrintCompilation enables printing of compilation details.
  • -XX:TieredStopAtLevel=1 instructs the JVM to stop compilation at Tier 1.

Among other methods listed, you will find the resolveNumber method:

46   20       1       org.zygiert.Main::resolveNumber (13 bytes)
  • 46 is the timestamp in milliseconds since the program started.
  • 20 is the compilation ID.
  • 1 is the compilation Tier.
  • org.zygiert.Main::resolveNumber is the method signature.
  • (13 bytes) indicates the bytecode size.

Having forced the C1 compiler to compile our method to Tier 1, the next step is to check the actual native code generated for it, along with some additional details. To do this, I will use two JVM options:

  • -XX:+UnlockDiagnosticVMOptions, which unlocks JVM diagnostic options.
  • -XX:+PrintAssembly, which prints the native code.

The full command looks as follows:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:TieredStopAtLevel=1 -jar target/app.jar 15000

Although the output is extensive, it is possible to locate the resolveNumber method within it. Here’s how the top of the printed details appears:

Compiled method (c1) 122   20       1       org.zygiert.Main::resolveNumber (13 bytes)
 total in heap  [0x0000719a289a3888,0x0000719a289a3a60] = 472
 main code      [0x0000719a289a3980,0x0000719a289a3a28] = 168
 stub code      [0x0000719a289a3a28,0x0000719a289a3a58] = 48
 oops           [0x0000719a289a3a58,0x0000719a289a3a60] = 8
 mutable data [0x00007199d8024530,0x00007199d8024568] = 56
 relocation     [0x00007199d8024530,0x00007199d8024560] = 48
 metadata       [0x00007199d8024560,0x00007199d8024568] = 8
 immutable data [0x00007199d80244b0,0x00007199d8024520] = 112
 dependencies   [0x00007199d80244b0,0x00007199d80244b8] = 8
 scopes pcs     [0x00007199d80244b8,0x00007199d8024508] = 80
 scopes data    [0x00007199d8024508,0x00007199d8024520] = 24

Currently, the most interesting information for me is the total in heap. Although the name of this property may be misleading, it actually indicates the amount of memory in the code cache occupied by this compiled method. For this method, the total size is 472 bytes, while the actual implementation of the method is 168 bytes.

Next, we can repeat the same procedure for Tier 2 and Tier 3. I won’t include the output here, but I will present the sizes for each tier.

Tier 1Tier 2Tier 3
total size472536576
main code size168232272

These results demonstrate why, when it’s not possible to compile a method from Tier 3 to Tier 4, it is reverted to Tier 2. The size is smaller due to the limited profiling data.

C2 compiler

This is the final stage of compilation where hot methods are placed. This stage is often referred to as the server compiler, and it implements several aggressive optimizations, including branch prediction, vectorization, and loop unrolling, among others.

Many of these optimization techniques rely on predictions. This means that, based on profiling data, the JVM assumes a method will be executed in the same manner as it has in the past. However, this assumption may not always hold true. In such cases, the JVM detects the discrepancy and resorts to using interpreted code for that method. Profiling data collection for this method then starts again, and the entire compilation cycle may repeat.

As mentioned earlier, C2-compiled methods are highly optimized and do not contain any profiling data. To view the details, you can run the following command:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:-BackgroundCompilation -jar target/app.jar 15000

In this command, I included the option:

  • -XX:-BackgroundCompilation which forces the JIT compiler to compile methods synchronously rather than in the background. I found this option necessary because my program finishes too quickly, and without it, the compilation may not complete before the program ends.

With this information, we can extend our previous table by adding the C2-compiled version of the resolveNumber method.

Tier 1Tier 2Tier 3Tier 4
total size472536576392
main code size168232272112

The C2 compilation method produces the smallest compiled method size, demonstrating that even for simple methods, C2 compilation offers significant advantages. However, a notable drawback is that it is time and resource-consuming.

It's important to note that we can control the number of threads used by the JIT compiler using the -XX:CICompilerCount option. Additionally, there is another option called -XX:CICompilerCountPerCPU. When enabled, this option allows the JVM to automatically determine the number of JIT compiler threads based on the number of CPU cores available. For instance, with a CPU that has 4 cores, you can expect to see 2 threads for the C2 compiler and 1 thread for the C1 compiler.

There are also several options available to manage the thresholds for each compilation tier. While I won’t go into detail about these options here, you can list them by running a specific command:

java -XX:+PrintFlagsFinal | grep Tier

Code cache

There is one more component that I would like to discuss: the code cache. Once code is compiled, it needs to be stored in a place where it can be easily accessed. The code cache is organized into three segments by default:

  • JVM internal (non-method) code - this segment stores interpreted code, which remains in the cache indefinitely. You can control the size of this segment using the option -XX:NonProfiledCodeHeapSize.
  • Profiled-code - this segment holds lightly optimized code compiled by the C1 compiler and has a short lifespan. To set the size for this segment, you can use the option -XX:ProfiledCodeHeapSize.
  • Non-profiled code - this segment contains fully optimized code compiled by the C2 compiler. You can manage its size with the option -XX:NonNMethodCodeHeapSize.

Additionally, there are several other JVM code cache-related options that can be useful:

  • -XX:InitialCodeCacheSize 
  • -XX:ReservedCodeCacheSize
  • -XX:CodeCacheExpansionSize
  • -XX:+PrintCodeCache

Conclusion
In this blog post, I aimed to explain the fundamental concepts related to transforming bytecode into native code. This process is highly dynamic and can be extensively customized. Although it requires considerable time and resources, it can significantly enhance the performance of our programs in the long run.

All the examples presented in this article were run on: Ubuntu 24.04.3 LTS, x86_64, with OpenJDK Runtime Environment Temurin-25+36 (build 25+36-LTS). The source code for the presented examples can be found on Github. I hope this article helps you gain a better understanding of how the JVM generates native code.

Reviewed by: Dariusz Broda, Sebastian Rabiej

Blog Comments powered by Disqus.