Contents

Getting internal JVM metric to GCP Monitoring with OpenTelemetry (and without writing the code)

Grzegorz Kocur

23 Oct 2023.8 minutes read

Getting internal JVM metric to GCP Monitoring with OpenTelemetry (and without writing the code) webp image

This article is a tutorial on how to get internal Java application metrics into the GCP Monitoring service.

I’ll use OpenTelemetry (as the OpenTelemetry operator on GKE), and I promise not to write any single line of Java code ;).

What is OpenTelemetry?

Let me rephrase this question - what OpenTelemetry is not?

It’s not just another monitoring tool or library. In fact, it’s a standard way to gather, process, and export telemetry data like logs, metrics, and traces.

Assume we have the application instrumented with some tool, for example, Kamon. But what if we change our mind later and want to use Prometheus? We need to rewrite a lot of the application code.

OpenTelemetry is a game changer; its common SDK and API allow the developers to send the signals almost anywhere. The only requirement is the backend understands the otlp protocol, or we have a compatible OpenTelemetry exporter.

OpenTelemetry alternatives

If existing applications are already instrumented and produce the Prometheus-compatible metrics, it’s possible to use them in GCP Monitoring with Managed Service for Prometheus. It’s a Prometheus-compatible backend and can be added automatically to the GKE cluster by enabling it in the GKE configuration. When you enable it - it will create metric collectors, the Google Managed Prometheus controller (gmp-controller), and some CRDs (like PodMonitorig) to set which pods or services to monitor.

Why use OpenTelemetry? Is GCP monitoring not enough?

GCP Monitoring works great for metrics produced by the GKE system itself, as well as some container metrics. Some, but not all. It’s not possible to get metrics from inside the application, like, for example, the JVM internals, libraries-specific metrics, etc.

To get such metrics, we need something inside the JVM, which publishes metrics to a collector.

Existing setup

Assume we already have an existing GKE cluster and several JVM-based microservices running on this cluster. The GKE nodes already have the write access to the monitoring, so we didn’t use the workload identity federation which is the recommended way to grant access to the GCP for the GKE workload (I made such a decision for simplicity).

OpenTelemetry installation

Since we are in the context of the Kubernetes cluster, let's install OpenTelemetry as a Kubernetes operator. The OpenTelemetry operator consists of:

  1. OpenTelemetry controller
  2. CRD for OpenTelemetry collector
  3. CRD for OpenTelemetry instrumentation.

There are several ways to install the operator: one can use operator manifest, the Helm chart or the Operator Hub. I’ll use the Helm chart. To install the Helm chart, you need to add the Helm repository:

$ helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
$ helm repo update

and install the chart:

$ helm install --namespace otel-system --create-namespace \
  opentelemetry-operator open-telemetry/opentelemetry-operator

In this example I used an otel-system namespace, but it can be anything of your choice.

Important note: the OpenTelemetry operator uses an admission webhook, which by default uses TLS. With the default configuration, Helm expects the cert-manager already installed on the cluster. There are some alternative ways to install the OpenTelemetry operator chart without the cert-manager described in the readme file.

Let’s confirm the chart was installed properly:

$ kubectl -n otel-system get deployments
NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
opentelemetry-operator          1/1      1           1           4min

$  kubectl get opentelemetrycollectors
No resources found in default namespace.

Creating the collector

Before we start gathering metrics from the applications, we need an OpenTelemetry collector. Its job is to receive, process, and potentially export the metrics to an external system - in our case, the GCP Monitoring.

There are multiple ways to run the collector:

Deployment

In this mode, the collector runs as a simple application. It’s easy to scale it out and in, roll back to the previous version, and do other things you usually do with the application. Probably the easiest way to deploy the collector.

Daemonset

If you need the collector to run as an agent on the Kubernetes nodes, it’s possible to run it as a daemonset. It can be useful to gather host-level metrics from nodes with the Host Metrics Receiver - in such a case, the collector needs access to the nodes directly. It cannot be scaled individually, though.

Statefulset

It’s useful if you need pods with static host names.

Sidecar

In this mode the collector is injected into a pod as an additional container. The use case is to offload the telemetry data as soon as possible to the collector. Each pod has its own collector.

In our case, we used the deployment as the simplest one.

Our collector manifest:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: gcp
  namespace: otel-system
spec:
  mode: "deployment"
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:

    exporters:
      googlecloud:

    processors:

      batch:
        # batch metrics before sending to reduce API usage
        send_batch_max_size: 200
        send_batch_size: 200
        timeout: 5s

      memory_limiter:
        # drop metrics if memory usage gets too high
        check_interval: 1s
        limit_percentage: 65
        spike_limit_percentage: 20

      resourcedetection:
        # detect cluster name and location
        detectors: [gcp]
        timeout: 10s

    service:
      pipelines:
        metrics:
          receivers: [otlp]
          processors: [batch, memory_limiter, resourcedetection]
          exporters: [googlecloud]

Let’s disassemble it into parts.

So we create a collector named gcp in the otel-system namespace.

Its spec has 2 fields: the mode set to deployment as mentioned earlier, and the config, which is a plain OpenTelemetry configuration file.

In the configuration we set the receivers, processors and exporters. In the end we set these three parts together as a pipeline.

We receive only data sent with otlp protocol, using grpc or http. Each of them has its own endpoint and port.

After receiving the data we process it. First, we use a batch processor to create batches from individual signals. We set the batch size and max size to 200 (number of metrics) and the timeout to 5s - after this time the batch is sent to the next pipeline component regardless of the current size.

The next pipeline component is the memory_limiter. Its job is to prevent an out-of-memory situation when it has to process a lot of data. We use the relative values so it will dynamically set the limits based on the available memory.

The last processor is the resourcedetection - used to detect resource information from the host system and inject it into signals with the OpenTelemetry format. Thanks to this we know the source pod name, its namespace etc. We use the gcp detector to get information from the google metadata server and GCP specific environment variables.

After processing the data we export it to googlecloud which basically means we write it as a metric in the GCP Monitoring system.

Let’s apply this manifest:

kubectl apply -f collector.yaml

After applying such manifest, we should get some output when listening to the collectors:

kubectl -n otel-system get opentelemetrycollectors
NAME   MODE         VERSION   READY   AGE     IMAGE                                         MANAGEMENT
gcp    deployment   0.85.0    1/1     3d23h   otel/opentelemetry-collector-contrib:0.85.0   managed

What happens when we create a collector?

The OpenTelemetry controller will create a workload (in our case - the deployment) and some services for the collection of data and the monitoring of the collector.

kubectl -n otel-system get services
NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
gcp-collector                    ClusterIP   10.200.179.87    <none>        4317/TCP,4318/TCP   3d23h
gcp-collector-headless           ClusterIP   None             <none>        4317/TCP,4318/TCP   3d23h
gcp-collector-monitoring         ClusterIP   10.200.47.54     <none>        8888/TCP            3d23h

Okay, we have a collector up and running - let’s send some data there.

Creating the instrumentation

The next step is to create an OpenTelemetry object called instrumentation. It is the configuration of how to enable the autoinstrumentation in the application.

For JVM applications, we use Java Autoinstrumentation, which works as a Java agent.

Our instrumentation:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: java-instrumentation
  namespace: default
spec:
  exporter:
    endpoint: http://gcp-collector.otel-system:4317
  java:
    env:
      - name: OTEL_METRICS_EXPORTER
        value: otlp
      - name: OTEL_TRACES_EXPORTER
        value: none
      - name: OTEL_LOGS_EXPORTER
        value: none

In our case we gather only metrics, so we set some environment variables to disable traces and logs.

We also set the address of the collector created in the previous step.

Now, we have the collector, the configuration of auto instrumentation, but still no data. Let’s enable instrumentation on some services.

Enabling the auto instrumentation

To enable the auto instrumentation it’s enough to add the annotation to the pod.

Please note, it’s easy to make a mistake here and to annotate the deployment. We want to annotate the pod, so we change the pod template:

spec:
  replicas: 2
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-java: “true”

Possible values of the instrumentation.opentelemetry.io/inject-java are:

  1. “true” to inject instrumentation with default name from the current namespace (for example, if we have only one instrumentation)
  2. “instrumentation-name” - the name of the instrumentation if we have multiple instrumentations in the namespace
  3. “namespace-name/instrumentation-name” - the name of the instrumentation from another namespace
  4. “false” - to exclude this workload from the instrumentation.

It is also possible to annotate the namespace so the instrumentation is injected into all pods in this namespace.

How does instrumentation injection work? It creates the init container, which downloads the java agent, injects some environment variables, and mounts the volume with the agent. It also injects or modifies the JAVA_TOOL_OPTIONS env variable and adds the -javaagent:/otel-auto-instrumentation/javaagent.jar values to it, which effectively adds the Java agent to the JVM.

Important note - this process works only for newly created pods. So if we change the deployment and later create the instrumentation - it won’t work, we’ll need to restart the pods manually.

Checking the metrics in GCP

Let’s try to find our metrics in GCP.

Thanks to the gcp resourcedetector, the monitoring system “knows” our JVM metrics are produced by the Kubernetes pods, so in the Metrics explorer, we’ll find them in the Kubernetes container.

metrics

Metrics are labeled with useful information like the service name, pod name, namespace, and also all Kubernetes labels taken from the pod, so it’s quite easy to filter the data in metric explorer.

Troubleshooting

The first place to check if something is not working correctly is the Metric management in the GCP Monitoring service. You can check the error rate and check the logs (needs audit logs enabled).

Also, checking the logs produced by the collector and the operator controller can be useful.

Wrap Up

Thanks to OpenTelemetry auto instrumentation, it’s possible to gather a lot of useful metrics without writing a single line of code in the application.

Auto instrumentation is just the beginning; it’s probably a good idea to instrument the application with some custom and/or business metrics.

This article is just a GCP-specific tutorial; I recommend getting familiar with the OpenTelemetry documentation.

Blog Comments powered by Disqus.