Observability part 1 - building blocks overview

Adam Pietrzykowski

17 May 2024.5 minutes read

Observability part 1 - building blocks overview webp image

Hello! Chances are that you clicked this article to get to know something about our new Meerkat project. This is the first article of a series about observability and our attempt towards giving back to the open source community, so I hope you'll stay tuned for the upcoming parts as well 😁

In case you need clarification on what I'm talking about - a few words of introduction first. Meerkat is an effort of our DevOps team to create an easy-to-adopt observability framework based on the OpenTelemetry Operator. Knowing that this topic is loud and important these days, we decided to develop something that can be used by solo developers of JVM applications and small teams alike. Note that this is still work in progress, so changes WILL happen. In this article, I'd like to talk about our choice of components that make up the stack right now.

What is observability?

Let me quote observability definition by OpenTelemetry:

Observability lets us understand a system from the outside by letting us ask questions about that system without knowing its inner workings.

The idea behind it is that we have some object that we want to know more about. This can quickly become quite philosophical, as the act of observation and terms like perception are related to epistemology. This is, however, not a blog about philosophy 😁

Let's just assume that our system is an application that is hosted somewhere, and we would like to know more about it by measuring it, receiving information it casts outside, tracking the flow of information between its internal components, etc. You may have encountered the concept of three main pillars of observability. They are:

  1. Logs
  2. Metrics
  3. Traces

Combined, they allow us to understand an application's state, troubleshoot it more easily in case of bugs or errors, fine-tune infrastructure requirements, and more.

Why should I be interested?

If you're a software developer or a DevOps engineer, you should get familiar with observability, because it can make your job easier, cut costs long-term, enable more sophisticated alerts, predict system utilization trends and so on. This article series will guide you through our journey of Meerkat development and teach you about observability tools and concepts.

I'll now tell you a bit about the blocks we have chosen to build our project with.

OpenTelemetry Operator

The heart of our project, OpenTelemetry Operator allows us to easily create Collectors and Instrumentation objects on an observed Kubernetes cluster. OpenTelemetry is a de facto standard of telemetry data creation and management. It's vendor-agnostic, which means you can use any kind of backend to store your metrics, logs, and trace data. This also means this is only a proxy between your application and the backend where the data will arrive. In our case, the OpenTelemetry Operator takes care of everything in between. By using JVM auto-instrumentation, we can receive all the data that a Java agent can extract from a working application automatically. On top of that, a developer can write code to create domain-specific business metrics that will also be collected by the operator.

Logs - Grafana Loki

Logs are arguably the most common telemetry data. Applications produce text that describes what is happening inside them. They can tell us about occurring errors and bugs, but also report that the system is working as expected, which is valuable information especially when we can link it to a particular time when it was produced. Loki allows us to store logs in any object storage solution, which makes it easy and cheap to manage. In the case of our try-me environment, we're using MinIO to create a data bucket locally, so there's no need to utilize any public cloud provider. Loki is easily scalable, which makes it a good solution for both small and big environments.

Traces - Grafana Tempo

You've probably heard the quote "It's not the destination, it's the journey". If logs tell us about events, traces describe journeys of requests between system components. This is particularly useful when we're trying to detect or fix errors in a distributed system or one that is made of many services (e.g., microservice architecture). Tempo serves as a backend for our traces. Like Loki, it also uses an object storage bucket to store data and is easily scalable.

Metrics - Grafana Mimir

Metrics show the current state of the system by providing us concrete data of various categories, like memory usage, CPU usage, free disk space, number of requests per second, etc. Using metrics, we can determine system performance and usage. One of the most popular metrics backends is Prometheus however we decided to give Grafana Mimir a shot. Like other Grafana products, it uses object storage for data and has Grafana visualization support out of the box. It's worth knowing that it's also compatible with Prometheus; however in our case, we're using the OTEL metrics format.

Visualization - Grafana

When you hear Grafana, you probably think about graphs. That's completely fair, after all, the Grafana visualization platform is the most popular project of the company with the same name. There's no use of backends if we can't read the data. Dashboards allow us to create graphs, tables, histograms, gauges, timelines, etc., to display gathered data in a human-readable way. Grafana also allows us to move from one related data type to another, for example, we can click at a metrics point to see logs that were produced at the same time or look into traces relevant to the log entry we're reading. This makes Grafana a powerful tool in the hands of a data-driven developer.

Why Grafana?

All of us were already familiar with Grafana and Loki. We saw an opportunity to try other Grafana products to create a unified, well-integrated and easy-to-deploy observability system. Secondly, Grafana products are open-source and community-driven, which makes them great candidates for a project like ours. We don't have to worry about them disappearing from the Internet any time soon and community support makes development much faster.

Reviewed by Paweł Maszota

Blog Comments powered by Disqus.