Contents

JVM and Kubernetes walk into a bar

JVM and Kubernetes walk into a bar webp image

Running a service in the Kubernetes cluster looks like a pretty easy task nowadays. In the case of JVM it looks like this:

  1. Pack the JAR files into the Docker image
  2. Create the deployment manifest as YAML/Json file
  3. Apply the manifest to the Kubernetes cluster.

Sounds easy, doesn't it?

But it turns out there are some traps, which in the worst case can make the production system unreliable, and at best cost some (or a lot) money. What is the problem? TL;DR - the resources (CPU and memory) management.

Let’s dive deeper into this topic.

What actually are containers?

The fundamental entity in Kubernetes is a Pod. And the pod is composed of one or more containers. So the very important question to understand how Kubernetes manages the resources is: what is actually the container?

According to the Wikipedia:

Linux Containers (LXC) is an operating-system-level virtualization method for running multiple isolated Linux systems (containers) on a control host using a single Linux kernel.

The Linux kernel provides the cgroups functionality that allows limitation and prioritization of resources (CPU, memory, block I/O, network, etc.) without the need for starting any virtual machines, and also the namespace isolation functionality that allows complete isolation of an application's view of the operating environment, including process trees, networking, user IDs and mounted file systems.

Sounds a bit complicated, but from the user perspective, the container is a process, which is isolated from the host (namespaced) together with its dependencies.

Let’s focus on the cgroups (abbreviated from control groups). How do the limitation and prioritization work?

Cgroups are hierarchical, and the child cgroups inherit some attributes from their parents. Cgroups contain many subsystems, but for us, two of them are the most interesting: CPU and memory.

CPU

CPU subsystem restricts access to the physical CPU with several parameters:

  • CPU shares - describes the relative priority of the process. By default, it’s 1024. It means that if you have 3 processes on the same level in the hierarchy: A, B, and C with the CPU shares set to:
    A - 1024
    B - 512
    C - 512
    It means that process A will get twice as much CPU time as processes B and C. B and C will get the same CPU time.
  • CPU period - works together with CPU quota, slices the time into small pieces
  • CPU quota - decides how much time of CPU period will this process get. For example, if the CPU period is set to 100.000 microseconds and CPU quota to 50.000 microseconds it means this process will get up to 0.5 of CPU core. The quota can be higher from the CPU period which makes sense on systems with multiple CPU cores.

Memory

The memory subsystem is simpler. From the user's perspective, it can limit the amount of memory allocatable by the process. If the process tries to allocate more memory than the limit - the process is killed by the OS.

Important to remember from this part: on Kubernetes, everything runs in the container, and each container uses cgroups, so access to the CPU and memory can be limited.

Resources management in the Kubernetes

Let’s go back to Kubernetes. It’s possible to manage access to the resources for the individual containers. Although it’s not possible to manage it on the pod level, it’s important to remember that resources are allocated to all containers in the pod sum. To make things a bit simpler let’s consider a pod with only one container:

---
apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: app
    image: images.my-company.example/app:v4
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

As you can see for each of the resources (CPU and memory) it’s possible to set 2 parameters: requests and limits.

Requests are the amount of CPU and memory needed for normal operation. The limits are hard limits - the container will never exceed these values. But the behavior for CPU and memory is completely different - the CPU is throttled and the process continues its work. For memory the system is more brutal - it kills the process.

The requests are important to the Kubernetes scheduler: the process will run on the node which has enough of resources. If there is no such node - the pod will remain in a Pending state.

Very important to remember - the tmpfs filesystem such as emptyDir is also tracked as used memory, so it’s not a good idea to store a lot of data there.

QOS on Kubernetes

Kubernetes uses QoS classes to make decisions about scheduling and evicting Pods.

When Kubernetes creates a Pod it assigns one of these QoS classes to the Pod:

  • Guaranteed
  • Burstable
  • BestEffort

Guaranteed

For a Pod to be given a QoS class of Guaranteed:

  • Every Container in the Pod must have a memory limit and a memory request.
  • For every Container in the Pod, the memory limit must equal the memory request.
  • Every Container in the Pod must have a CPU limit and a CPU request.
  • For every Container in the Pod, the CPU limit must equal the CPU request.

Such containers won’t be killed unless they try to exceed their memory limits.

Protip: you don’t have to set the requests if you set the limits. If there are limits and no requests - the requests are set automatically to the same values as limits.

Burstable

A Pod is given a QoS class of Burstable if:

  • The Pod does not meet the criteria for QoS class Guaranteed.
  • At least one Container in the Pod has a memory or CPU request or limit.
  • Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist.

Best effort

If no container in the pod has requests and limits set - such pod is given the Best effort class. Such a pod is the first candidate to be killed if the node is under pressure.

Partner with DevOps experts to increase developer productivity with hand-picked infrastructure automation tools and Platform Engineering. Explore the offer >>

How do requests and limits translate for containers?

After the short recollection of the cgroups and how to manage the resources in the Kubernetes cluster let’s ask an important question: how are these both related?

To check this let’s create a few pods and set some requests and limits.

First: pod with requests and without limits

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: busybox
  name: busybox
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - sleep 6000
    image: busybox
    name: busybox
    resources:
      requests:
        cpu: 100m
        memory: 250Mi
  restartPolicy: Always
status: {}

Let’s inspect this container on the node:

"linux": {
        "resources": {
          "devices": [
            {
              "allow": false,
              "access": "rwm"
            }
          ],
          "memory": {},
          "cpu": {
            "shares": 102,
            "period": 100000
          }
        },

As we can see, the memory cgroup is empty (so the pod can use the whole memory of the node) and the CPU has set cpu.shares to 102 and period to 100.000 microseconds.

Why 102 shares? We request 100m (millicores) which means 1/10 of one core. The parent cgroup node has CPU shares set to the default 1024, so 1024/10 gives 102.

It’s important to note that there is no cpu.quota which means this process can use up to the whole CPU power of the node.

Note, since we set the requests but didn’t set the limits the QoS class is set to Burstable:

$ kubectl get pods jvm -o jsonpath='{.status.qosClass}'
Burstable

Let’s try to set limits (and implicit requests):

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: busybox
  name: busybox
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - sleep 6000
    image: busybox
    name: busybox
    resources:
      limits:
        cpu: 100m
        memory: 250Mi
  restartPolicy: Always
status: {}

How does it look now inside the container?

"linux": {
        "resources": {
          "devices": [
            {
              "allow": false,
              "access": "rwm"
            }
          ],
          "memory": {
            "limit": 262144000
          },
          "cpu": {
            "shares": 102,
            "quota": 10000,
            "period": 100000
          }
        },

The memory limit is set to 262144000 bytes (256MB). The CPU looks similar to the previous example, but this time the quota is set to 10.000 so the process can use 1/10 of the core.

If we check the QoS class again, this pod is given the Guaranted class since all containers have requests and limits set to the same values:

kubectl get pods jvm -o jsonpath='{.status.qosClass}'
Guaranteed

How many CPU cores do I have on a node with 8 cores? Container awareness in JVM

This is the interesting part - how is it related to the JVM?

From Java 11 (backported to 8u191 and higher) the JVM “knows” it lives in the container. What does it mean? Some defaults are calculated based on the container settings, not on the physical machine resources. This is especially important for the CPU. Why? Because the number of available cores is calculated based on the… CPU shares.

Let’s translate it to real life.

Imagine you have a node with 8 cores. You set the CPU request to 500m (millicores, 0.5 of the core) because it’s a normal demand for your application. But it's a JVM, so you’d like to use all 8 cores during startup (as you know, it can take soooomeeee time to start a JVM). So you don’t set a limit.

Let’s try this out:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: jvm
  name: jvm
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - sleep 6000
    image: eclipse-temurin:17.0.4_8-jdk-jammy
    name: jvm
    resources:
      requests:
        cpu: 100m
        memory: 250Mi
  restartPolicy: Always
status: {}

Let’s run a shell on the container and run some java code in jshell:

$ kubectl exec -ti jvm -- bash
root@jvm:/# jshell
Nov 05, 2022 3:15:12 PM java.util.prefs.FileSystemPreferences$1 run
INFO: Created user preferences directory.
| Welcome to JShell -- Version 17.0.4
| For an introduction type: /help intro

jshell> Runtime.getRuntime().availableProcessors()
$1 ==> 1

But… how? We didn’t set the limit, why our JVM “sees” only 1 processor?

Because it’s calculated based on the cpu.shares. The cpu.shares are divided by 1024 and ceiled to the whole number. And remember? The shares are set based on the resources.requests even if no limit is set!

And what about the memory? To check this let’s create a pod with a limit set:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: jvm
  name: jvm
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - sleep 6000
    image: eclipse-temurin:17.0.4_8-jdk-jammy
    name: jvm
    resources:
      limits:
        cpu: 500m
        memory: 1Gi
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

Let’s check maximal memory with jshell:

java -XX:+PrintFlagsFinal -version | grep MaxHeapSize
   size_t MaxHeapSize = 268435456

What does it mean? By default, the JVM sets the maximum heap size to 25% of the available memory. It’s quite conservative, and you probably don’t want to run JVM which uses only 256Mi for heap on the container with 1Gi of memory.

What to do with all this knowledge? How to live?

CPU

JVM doesn’t run well on a single core. We already know we can increase the available processors by increasing the resources.requests for the container. But this can be costly - if you request four CPUs on an 8 cores machine you will fit only one such pod on the node, even if this will utilize only a small amount of the CPU power in runtime (kubelet reserves some CPU power for itself).

Actually, it's a better way - you can set an explicit number of available cores by setting the -XX:ActiveProcessorCount flag.

Let’s check it, this time using another java flag, which gives a lot of info about the container in which the JVM runs:

root@jvm:/# java -XX:ActiveProcessorCount=4 -XshowSettings:system -version
Operating System Metrics:
    Provider: cgroupv1
    Effective CPU Count: 4
    CPU Period: 100000us
    CPU Quota: 50000us
    CPU Shares: 512us

Memory

As already said, 25% of the memory available to the container doesn’t sound like a proper value for the heap size.

There are 2 options:

  1. Set the maximum heap size with -Xmx flag
  2. Change the 25% default to a more reasonable value.

In my opinion, the latter is a better option - the heap size is changed dynamically based on available memory. This can be achieved by setting the -XX:MaxRAMPercentage flag.

There is also another one: -XX:MinRAMPercentage but beware, it’s not what you think - it’s not an initial heap size. It’s a maximum heap size on systems with a small amount of memory. Pretty tricky :). The initial heap size can be set with -XX:InitialRAMPercentage

Wrapping up

Some best practices when running JVM on the Kubernetes cluster:

  • Always use resources.requests for the container running JVM to help the scheduler place the pod on the node.
  • Always set the resources.limits to ensure your application won’t misbehave and “eat” all memory and/or CPU power of the node.
  • If you want to be sure the pod won’t be evicted - set the resources.requests and resources.limits to the same value so the pod is given Guaranted QoS class.
  • It’s a good idea to explicitly set the number of the available cores by setting the -XX:ActiveProcessorCount flag
  • Set the-XX:MaxRAMPercentage flag to set the maximum heap size to a reasonable value. It's not possible to predict the memory consumption for the whole container - it's a heap, meta-space, off-heap, etc., so 60 (which means 60%) seems to be a fair value.
  • Never set -XX:-UseContainerSupport - it disables the support for the containers (container awareness) completely.

Platform Engineering 

Blog Comments powered by Disqus.