Getting internal JVM metric to GCP Monitoring with OpenTelemetry (and without writing the code)
This article is a tutorial on how to get internal Java application metrics into the GCP Monitoring service.
I’ll use OpenTelemetry (as the OpenTelemetry operator on GKE), and I promise not to write any single line of Java code ;).
What is OpenTelemetry?
Let me rephrase this question - what OpenTelemetry is not?
It’s not just another monitoring tool or library. In fact, it’s a standard way to gather, process, and export telemetry data like logs, metrics, and traces.
Assume we have the application instrumented with some tool, for example, Kamon. But what if we change our mind later and want to use Prometheus? We need to rewrite a lot of the application code.
OpenTelemetry is a game changer; its common SDK and API allow the developers to send the signals almost anywhere. The only requirement is the backend understands the otlp protocol, or we have a compatible OpenTelemetry exporter.
If existing applications are already instrumented and produce the Prometheus-compatible metrics, it’s possible to use them in GCP Monitoring with Managed Service for Prometheus. It’s a Prometheus-compatible backend and can be added automatically to the GKE cluster by enabling it in the GKE configuration. When you enable it - it will create metric collectors, the Google Managed Prometheus controller (gmp-controller), and some CRDs (like PodMonitorig) to set which pods or services to monitor.
Why use OpenTelemetry? Is GCP monitoring not enough?
GCP Monitoring works great for metrics produced by the GKE system itself, as well as some container metrics. Some, but not all. It’s not possible to get metrics from inside the application, like, for example, the JVM internals, libraries-specific metrics, etc.
To get such metrics, we need something inside the JVM, which publishes metrics to a collector.
Assume we already have an existing GKE cluster and several JVM-based microservices running on this cluster. The GKE nodes already have the write access to the monitoring, so we didn’t use the workload identity federation which is the recommended way to grant access to the GCP for the GKE workload (I made such a decision for simplicity).
Since we are in the context of the Kubernetes cluster, let's install OpenTelemetry as a Kubernetes operator. The OpenTelemetry operator consists of:
- OpenTelemetry controller
- CRD for OpenTelemetry collector
- CRD for OpenTelemetry instrumentation.
There are several ways to install the operator: one can use operator manifest, the Helm chart or the Operator Hub. I’ll use the Helm chart. To install the Helm chart, you need to add the Helm repository:
$ helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts $ helm repo update
and install the chart:
$ helm install --namespace otel-system --create-namespace \ opentelemetry-operator open-telemetry/opentelemetry-operator
In this example I used an otel-system namespace, but it can be anything of your choice.
Important note: the OpenTelemetry operator uses an admission webhook, which by default uses TLS. With the default configuration, Helm expects the
cert-manager already installed on the cluster. There are some alternative ways to install the OpenTelemetry operator chart without the
cert-manager described in the readme file.
Let’s confirm the chart was installed properly:
$ kubectl -n otel-system get deployments NAME READY UP-TO-DATE AVAILABLE AGE opentelemetry-operator 1/1 1 1 4min $ kubectl get opentelemetrycollectors No resources found in default namespace.
Creating the collector
Before we start gathering metrics from the applications, we need an OpenTelemetry collector. Its job is to receive, process, and potentially export the metrics to an external system - in our case, the GCP Monitoring.
There are multiple ways to run the collector:
In this mode, the collector runs as a simple application. It’s easy to scale it out and in, roll back to the previous version, and do other things you usually do with the application. Probably the easiest way to deploy the collector.
If you need the collector to run as an agent on the Kubernetes nodes, it’s possible to run it as a daemonset. It can be useful to gather host-level metrics from nodes with the Host Metrics Receiver - in such a case, the collector needs access to the nodes directly. It cannot be scaled individually, though.
It’s useful if you need pods with static host names.
In this mode the collector is injected into a pod as an additional container. The use case is to offload the telemetry data as soon as possible to the collector. Each pod has its own collector.
In our case, we used the deployment as the simplest one.
Our collector manifest:
apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: gcp namespace: otel-system spec: mode: "deployment" config: | receivers: otlp: protocols: grpc: http: exporters: googlecloud: processors: batch: # batch metrics before sending to reduce API usage send_batch_max_size: 200 send_batch_size: 200 timeout: 5s memory_limiter: # drop metrics if memory usage gets too high check_interval: 1s limit_percentage: 65 spike_limit_percentage: 20 resourcedetection: # detect cluster name and location detectors: [gcp] timeout: 10s service: pipelines: metrics: receivers: [otlp] processors: [batch, memory_limiter, resourcedetection] exporters: [googlecloud]
Let’s disassemble it into parts.
So we create a collector named
gcp in the
spec has 2 fields: the mode set to deployment as mentioned earlier, and the config, which is a plain OpenTelemetry configuration file.
In the configuration we set the receivers, processors and exporters. In the end we set these three parts together as a pipeline.
We receive only data sent with
otlp protocol, using grpc or http. Each of them has its own endpoint and port.
After receiving the data we process it. First, we use a batch processor to create batches from individual signals. We set the batch size and max size to 200 (number of metrics) and the timeout to 5s - after this time the batch is sent to the next pipeline component regardless of the current size.
The next pipeline component is the memory_limiter. Its job is to prevent an out-of-memory situation when it has to process a lot of data. We use the relative values so it will dynamically set the limits based on the available memory.
The last processor is the resourcedetection - used to detect resource information from the host system and inject it into signals with the OpenTelemetry format. Thanks to this we know the source pod name, its namespace etc. We use the
gcp detector to get information from the google metadata server and GCP specific environment variables.
After processing the data we export it to
googlecloud which basically means we write it as a metric in the GCP Monitoring system.
Let’s apply this manifest:
kubectl apply -f collector.yaml
After applying such manifest, we should get some output when listening to the collectors:
kubectl -n otel-system get opentelemetrycollectors NAME MODE VERSION READY AGE IMAGE MANAGEMENT gcp deployment 0.85.0 1/1 3d23h otel/opentelemetry-collector-contrib:0.85.0 managed
What happens when we create a collector?
The OpenTelemetry controller will create a workload (in our case - the deployment) and some services for the collection of data and the monitoring of the collector.
kubectl -n otel-system get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gcp-collector ClusterIP 10.200.179.87 <none> 4317/TCP,4318/TCP 3d23h gcp-collector-headless ClusterIP None <none> 4317/TCP,4318/TCP 3d23h gcp-collector-monitoring ClusterIP 10.200.47.54 <none> 8888/TCP 3d23h
Okay, we have a collector up and running - let’s send some data there.
Creating the instrumentation
The next step is to create an OpenTelemetry object called
instrumentation. It is the configuration of how to enable the autoinstrumentation in the application.
For JVM applications, we use Java Autoinstrumentation, which works as a Java agent.
apiVersion: opentelemetry.io/v1alpha1 kind: Instrumentation metadata: name: java-instrumentation namespace: default spec: exporter: endpoint: http://gcp-collector.otel-system:4317 java: env: - name: OTEL_METRICS_EXPORTER value: otlp - name: OTEL_TRACES_EXPORTER value: none - name: OTEL_LOGS_EXPORTER value: none
In our case we gather only metrics, so we set some environment variables to disable traces and logs.
We also set the address of the collector created in the previous step.
Now, we have the collector, the configuration of auto instrumentation, but still no data. Let’s enable instrumentation on some services.
Enabling the auto instrumentation
To enable the auto instrumentation it’s enough to add the annotation to the pod.
Please note, it’s easy to make a mistake here and to annotate the deployment. We want to annotate the pod, so we change the pod template:
spec: replicas: 2 template: metadata: annotations: instrumentation.opentelemetry.io/inject-java: “true”
Possible values of the
- “true” to inject instrumentation with default name from the current namespace (for example, if we have only one instrumentation)
- “instrumentation-name” - the name of the instrumentation if we have multiple instrumentations in the namespace
- “namespace-name/instrumentation-name” - the name of the instrumentation from another namespace
- “false” - to exclude this workload from the instrumentation.
It is also possible to annotate the namespace so the instrumentation is injected into all pods in this namespace.
How does instrumentation injection work? It creates the init container, which downloads the java agent, injects some environment variables, and mounts the volume with the agent. It also injects or modifies the
JAVA_TOOL_OPTIONS env variable and adds the
-javaagent:/otel-auto-instrumentation/javaagent.jar values to it, which effectively adds the Java agent to the JVM.
Important note - this process works only for newly created pods. So if we change the deployment and later create the instrumentation - it won’t work, we’ll need to restart the pods manually.
Checking the metrics in GCP
Let’s try to find our metrics in GCP.
Thanks to the
gcp resourcedetector, the monitoring system “knows” our JVM metrics are produced by the Kubernetes pods, so in the Metrics explorer, we’ll find them in the
Metrics are labeled with useful information like the service name, pod name, namespace, and also all Kubernetes labels taken from the pod, so it’s quite easy to filter the data in metric explorer.
The first place to check if something is not working correctly is the
Metric management in the GCP Monitoring service. You can check the error rate and check the logs (needs audit logs enabled).
Also, checking the logs produced by the collector and the operator controller can be useful.
Thanks to OpenTelemetry auto instrumentation, it’s possible to gather a lot of useful metrics without writing a single line of code in the application.
Auto instrumentation is just the beginning; it’s probably a good idea to instrument the application with some custom and/or business metrics.
This article is just a GCP-specific tutorial; I recommend getting familiar with the OpenTelemetry documentation.