JVM and Kubernetes walk into a bar
Running a service in the Kubernetes cluster looks like a pretty easy task nowadays. In the case of JVM it looks like this:
- Pack the JAR files into the Docker image
- Create the deployment manifest as YAML/Json file
- Apply the manifest to the Kubernetes cluster.
Sounds easy, doesn't it?
But it turns out there are some traps, which in the worst case can make the production system unreliable, and at best cost some (or a lot) money. What is the problem? TL;DR - the resources (CPU and memory) management.
Let’s dive deeper into this topic.
What actually are containers?
The fundamental entity in Kubernetes is a Pod. And the pod is composed of one or more containers. So the very important question to understand how Kubernetes manages the resources is: what is actually the container?
According to the Wikipedia:
Linux Containers (LXC) is an operating-system-level virtualization method for running multiple isolated Linux systems (containers) on a control host using a single Linux kernel.
The Linux kernel provides the cgroups functionality that allows limitation and prioritization of resources (CPU, memory, block I/O, network, etc.) without the need for starting any virtual machines, and also the namespace isolation functionality that allows complete isolation of an application's view of the operating environment, including process trees, networking, user IDs and mounted file systems.
Sounds a bit complicated, but from the user perspective, the container is a process, which is isolated from the host (namespaced) together with its dependencies.
Let’s focus on the cgroups (abbreviated from control groups). How do the limitation and prioritization work?
Cgroups are hierarchical, and the child cgroups inherit some attributes from their parents. Cgroups contain many subsystems, but for us, two of them are the most interesting: CPU and memory.
CPU
CPU subsystem restricts access to the physical CPU with several parameters:
- CPU shares - describes the relative priority of the process. By default, it’s 1024. It means that if you have 3 processes on the same level in the hierarchy: A, B, and C with the CPU shares set to:
A - 1024
B - 512
C - 512
It means that process A will get twice as much CPU time as processes B and C. B and C will get the same CPU time. - CPU period - works together with CPU quota, slices the time into small pieces
- CPU quota - decides how much time of CPU period will this process get. For example, if the CPU period is set to 100.000 microseconds and CPU quota to 50.000 microseconds it means this process will get up to 0.5 of CPU core. The quota can be higher from the CPU period which makes sense on systems with multiple CPU cores.
Memory
The memory subsystem is simpler. From the user's perspective, it can limit the amount of memory allocatable by the process. If the process tries to allocate more memory than the limit - the process is killed by the OS.
Important to remember from this part: on Kubernetes, everything runs in the container, and each container uses cgroups, so access to the CPU and memory can be limited.
Resources management in the Kubernetes
Let’s go back to Kubernetes. It’s possible to manage access to the resources for the individual containers. Although it’s not possible to manage it on the pod level, it’s important to remember that resources are allocated to all containers in the pod sum. To make things a bit simpler let’s consider a pod with only one container:
---
apiVersion: v1
kind: Pod
metadata:
name: frontend
spec:
containers:
- name: app
image: images.my-company.example/app:v4
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
As you can see for each of the resources (CPU and memory) it’s possible to set 2 parameters: requests and limits.
Requests are the amount of CPU and memory needed for normal operation. The limits are hard limits - the container will never exceed these values. But the behavior for CPU and memory is completely different - the CPU is throttled and the process continues its work. For memory the system is more brutal - it kills the process.
The requests are important to the Kubernetes scheduler: the process will run on the node which has enough of resources. If there is no such node - the pod will remain in a Pending
state.
Very important to remember - the tmpfs
filesystem such as emptyDir
is also tracked as used memory, so it’s not a good idea to store a lot of data there.
QOS on Kubernetes
Kubernetes uses QoS classes to make decisions about scheduling and evicting Pods.
When Kubernetes creates a Pod it assigns one of these QoS classes to the Pod:
- Guaranteed
- Burstable
- BestEffort
Guaranteed
For a Pod to be given a QoS class of Guaranteed:
- Every Container in the Pod must have a memory limit and a memory request.
- For every Container in the Pod, the memory limit must equal the memory request.
- Every Container in the Pod must have a CPU limit and a CPU request.
- For every Container in the Pod, the CPU limit must equal the CPU request.
Such containers won’t be killed unless they try to exceed their memory limits.
Protip: you don’t have to set the requests if you set the limits. If there are limits and no requests - the requests are set automatically to the same values as limits.
Burstable
A Pod is given a QoS class of Burstable if:
- The Pod does not meet the criteria for QoS class Guaranteed.
- At least one Container in the Pod has a memory or CPU request or limit.
- Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist.
Best effort
If no container in the pod has requests and limits set - such pod is given the Best effort class. Such a pod is the first candidate to be killed if the node is under pressure.
Partner with DevOps experts to increase developer productivity with hand-picked infrastructure automation tools and Platform Engineering. Explore the offer >>
How do requests and limits translate for containers?
After the short recollection of the cgroups and how to manage the resources in the Kubernetes cluster let’s ask an important question: how are these both related?
To check this let’s create a few pods and set some requests and limits.
First: pod with requests and without limits
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: busybox
name: busybox
spec:
containers:
- command:
- /bin/sh
- -c
- sleep 6000
image: busybox
name: busybox
resources:
requests:
cpu: 100m
memory: 250Mi
restartPolicy: Always
status: {}
Let’s inspect this container on the node:
"linux": {
"resources": {
"devices": [
{
"allow": false,
"access": "rwm"
}
],
"memory": {},
"cpu": {
"shares": 102,
"period": 100000
}
},
As we can see, the memory cgroup is empty (so the pod can use the whole memory of the node) and the CPU has set cpu.shares
to 102 and period to 100.000 microseconds.
Why 102 shares? We request 100m (millicores) which means 1/10 of one core. The parent cgroup node has CPU shares set to the default 1024, so 1024/10 gives 102.
It’s important to note that there is no cpu.quota
which means this process can use up to the whole CPU power of the node.
Note, since we set the requests but didn’t set the limits the QoS class is set to Burstable:
$ kubectl get pods jvm -o jsonpath='{.status.qosClass}'
Burstable
Let’s try to set limits (and implicit requests):
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: busybox
name: busybox
spec:
containers:
- command:
- /bin/sh
- -c
- sleep 6000
image: busybox
name: busybox
resources:
limits:
cpu: 100m
memory: 250Mi
restartPolicy: Always
status: {}
How does it look now inside the container?
"linux": {
"resources": {
"devices": [
{
"allow": false,
"access": "rwm"
}
],
"memory": {
"limit": 262144000
},
"cpu": {
"shares": 102,
"quota": 10000,
"period": 100000
}
},
The memory limit is set to 262144000
bytes (256MB). The CPU looks similar to the previous example, but this time the quota is set to 10.000
so the process can use 1/10 of the core.
If we check the QoS class again, this pod is given the Guaranted
class since all containers have requests and limits set to the same values:
kubectl get pods jvm -o jsonpath='{.status.qosClass}'
Guaranteed
How many CPU cores do I have on a node with 8 cores? Container awareness in JVM
This is the interesting part - how is it related to the JVM?
From Java 11 (backported to 8u191 and higher) the JVM “knows” it lives in the container. What does it mean? Some defaults are calculated based on the container settings, not on the physical machine resources. This is especially important for the CPU. Why? Because the number of available cores is calculated based on the… CPU shares.
Let’s translate it to real life.
Imagine you have a node with 8 cores. You set the CPU request to 500m (millicores, 0.5 of the core) because it’s a normal demand for your application. But it's a JVM, so you’d like to use all 8 cores during startup (as you know, it can take soooomeeee time to start a JVM). So you don’t set a limit.
Let’s try this out:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: jvm
name: jvm
spec:
containers:
- command:
- /bin/sh
- -c
- sleep 6000
image: eclipse-temurin:17.0.4_8-jdk-jammy
name: jvm
resources:
requests:
cpu: 100m
memory: 250Mi
restartPolicy: Always
status: {}
Let’s run a shell on the container and run some java code in jshell:
$ kubectl exec -ti jvm -- bash
root@jvm:/# jshell
Nov 05, 2022 3:15:12 PM java.util.prefs.FileSystemPreferences$1 run
INFO: Created user preferences directory.
| Welcome to JShell -- Version 17.0.4
| For an introduction type: /help intro
jshell> Runtime.getRuntime().availableProcessors()
$1 ==> 1
But… how? We didn’t set the limit, why our JVM “sees” only 1 processor?
Because it’s calculated based on the cpu.shares
. The cpu.shares
are divided by 1024 and ceiled to the whole number. And remember? The shares are set based on the resources.requests
even if no limit is set!
And what about the memory? To check this let’s create a pod with a limit set:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: jvm
name: jvm
spec:
containers:
- command:
- /bin/sh
- -c
- sleep 6000
image: eclipse-temurin:17.0.4_8-jdk-jammy
name: jvm
resources:
limits:
cpu: 500m
memory: 1Gi
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
Let’s check maximal memory with jshell:
java -XX:+PrintFlagsFinal -version | grep MaxHeapSize
size_t MaxHeapSize = 268435456
What does it mean? By default, the JVM sets the maximum heap size to 25% of the available memory. It’s quite conservative, and you probably don’t want to run JVM which uses only 256Mi for heap on the container with 1Gi of memory.
What to do with all this knowledge? How to live?
CPU
JVM doesn’t run well on a single core. We already know we can increase the available processors by increasing the resources.requests
for the container. But this can be costly - if you request four CPUs on an 8 cores machine you will fit only one such pod on the node, even if this will utilize only a small amount of the CPU power in runtime (kubelet reserves some CPU power for itself).
Actually, it's a better way - you can set an explicit number of available cores by setting the -XX:ActiveProcessorCount
flag.
Let’s check it, this time using another java flag, which gives a lot of info about the container in which the JVM runs:
root@jvm:/# java -XX:ActiveProcessorCount=4 -XshowSettings:system -version
Operating System Metrics:
Provider: cgroupv1
Effective CPU Count: 4
CPU Period: 100000us
CPU Quota: 50000us
CPU Shares: 512us
Memory
As already said, 25% of the memory available to the container doesn’t sound like a proper value for the heap size.
There are 2 options:
- Set the maximum heap size with
-Xmx
flag - Change the 25% default to a more reasonable value.
In my opinion, the latter is a better option - the heap size is changed dynamically based on available memory. This can be achieved by setting the -XX:MaxRAMPercentage
flag.
There is also another one: -XX:MinRAMPercentage
but beware, it’s not what you think - it’s not an initial heap size. It’s a maximum heap size on systems with a small amount of memory. Pretty tricky :). The initial heap size can be set with -XX:InitialRAMPercentage
Wrapping up
Some best practices when running JVM on the Kubernetes cluster:
- Always use
resources.requests
for the container running JVM to help the scheduler place the pod on the node. - Always set the
resources.limits
to ensure your application won’t misbehave and “eat” all memory and/or CPU power of the node. - If you want to be sure the pod won’t be evicted - set the
resources.requests
andresources.limits
to the same value so the pod is givenGuaranted
QoS class. - It’s a good idea to explicitly set the number of the available cores by setting the
-XX:ActiveProcessorCount
flag - Set the
-XX:MaxRAMPercentage
flag to set the maximum heap size to a reasonable value. It's not possible to predict the memory consumption for the whole container - it's a heap, meta-space, off-heap, etc., so 60 (which means 60%) seems to be a fair value. - Never set
-XX:-UseContainerSupport
- it disables the support for the containers (container awareness) completely.