Akka monitoring with Kamon part 2

Andrzej Ludwikowski

07 Jun 2017.5 minutes read

At this point, we know some basics about Kamon configuration, mentioned in the previous post. It’s time to dig deeper and get familiar with the Kamon tracing feature.

Environment

Before all that, we need to prepare some environment. Monitoring a single instance application is (usually) a piece of cake. However, we didn't choose Akka stack to run our application on a single node. Usually, we want a cluster (or a multi-cluster) deployment. Monitoring such infrastructure is a completely different story.

Don't worry, for the set up, all you need to do is the following:

  • prepare about 3GB of free RAM;
  • clone demo repository and checkout tag part2;
  • create Datadog account and export DATADOG_API_KEY, copy it from here;
  • run Vagrant script (from vagrant directory).
    .../akka-sandbox/vagrant > export DATADOG_API_KEY=123123123123123
    .../akka-sandbox/vagrant > vagrant up

Now you can launch your favourite TV show, relax and wait for the Provisioning done! message. Afterwards, you can enjoy an Akka cluster with two nodes. Also you will be able to see two entries in Datadog infrastructure panel:

Fat JAR with Kamon

It is about time to use Kamon in a more production-like way. Launching the app with sbt run is no longer a valid option. We need to create a fat JAR and provide an Aspect Weaver as the Java Agent. Fortunately, we can skip the second part thanks to the kamon-autoweave dependency that will handle weaving automatically for us. Make sure that you use a proper merging strategy for specific files during sbt-assembly plugin configuration. For a start, you can always copy the configuration from the demo project
Now you can simply start the app with:

java -jar service/target/scala-2.11/sandbox-service-assembly-0.0.1-SNAPSHOT.jar

Tracing

Let's modify our request tracing example from the previous post. We would like to distinguish requests from each other. Kamon provides a very convenient way to do this in Akka HTTP by using KamonTraceDirectives:

class UserController(....) extends KamonTraceDirectives {
  ...
 def routes: Route = pathPrefix("user") {
   pathEnd {
     post {
       traceName("user-creation") {
         ...
       }}} ~
     pathPrefix(Segment) { uuid =>
       pathEnd {
         get {
           traceName("get-user") {
             ...
           }}}}
  }
}

In case you are not using Akka HTTP, or you want to trace different part of the code, you can achieve this manually. Moreover, Kamon will handle propagating the trace context to nested Futures, actors (even the remote ones). This is a killer feature in my opinion, especially regarding the remote actors, where otherwise you would need to serialize and deserialize each message together with the tracing context. Of course, nothing is for free - this approach (byte code instrumentation) could lead to problems with Kamon backward/forward compatibility with Akka binaries.

It’s time to gather some results. First, as usual - a Gatling simulation:

sbt "perfTest/gatling:testOnly com.softwaremill.sandbox.UserCreationInClusterSimulation"

After 4 minutes and some Datadog dashboard customization (see the last paragraph) you will get a nice chart:

At this point, we can see that both nodes behave very similar. The 95th percentile of user-creation trace is about 3 seconds, while the get-user trace is 0,5 second. This is useful, but we still need more details to answer the question: why is user-creation so slow?

Segments

Kamon traces can be more detailed if we split them into segments. Beware, now you will see a pure evil: blocking inside an actor with Thread.sleep()! I hope you will forgive me this moment of weakness. It is only a demo:

override def receiveCommand: Receive = {
 case command: CreateUser =>
   val currentSender = sender()
   validationLogic
   persist(UserCreated(command.name)) { e =>
     userState = Some(UserState(e.name))
     postCreationProcessing
     currentSender ! Status.Success(e.name)
   }
}

private def postCreationProcessing = {
 val segment = Tracer.currentContext.startSegment("post-creation-processing", "business-logic", "xyz")
 Thread.sleep(random.nextInt(500))
 segment.finish()
}

private def validationLogic = {
 val segment = Tracer.currentContext.startSegment("external-validation-service", "validation-logic", "xyz")
 Thread.sleep(random.nextInt(userActorConfig.userCreationLag))
 segment.finish()
}

We just add two segments to the user-creation trace: external-validation-service and post-creation-processing. After running the same Gatling simulation, we can see more information about our suspicious trace:

Post processing takes approximately the same time on both nodes: up to 0.5 second. The interesting part is the external-validation-service segment. Clearly, one node (sandbox-host-1) is faster than the other. This could be a bug, an infrastructure problem or, in this case, an intentional action. The second node has a different user-creation-lag param value.

This demonstrates why good monitoring should be so detailed. Based on the "routes" chart, we could assume that all nodes took the same time to process the requests. Since we are using an Akka cluster with sharding, the slower node could respond to messages from both nodes (with HTTP endpoints) and this is why we see the illusion of similar processing time.

Trace token

Ok, now we know more about tracing requests across a single cluster. But what to do if we have a huge IT system with many applications, each deployed as a separate cluster? Let's assume that the applications are communicating over HTTP (which is quite common).
Well, Kamon has a solution for this problem as well. Each trace is identified by a trace token which could be automatically generated or provided manually. In case of HTTP communication, you can pass it via a X-Trace-Token header (or configure a custom header by kamon.akka-http.trace-token-header-name param). This means that, besides monitoring, you just get a distributed correlation id for free. Remember to include this header in each sub-request, across the system. You can also access the token for logging purposes:

log.debug("creating user [token {}]", Tracer.currentContext.token)

Summary

Tracing is definitely one of the most important Kamon features. You can use it very easily,
not only with the Akka stack, but also with Play or Spring! This is not the end of our journey. In the third part, I will try to focus on Kamon global actor system monitoring features.


This blog post is a part 2 of the 3-part series, see part 1 | part 3.

*Datadog dashboards

For those of you who are not familiar with Datadog configuration - here is a simple tutorial for creating charts, mentioned in all parts of the blog post:

  1. Create a new TimeBoard dashboard:

  2. Drag and drop the time series graph:

  3. Use the JSON graph configuration from ../akka-sandbox/datadog/graphs.js for a specific graph:

  4. Repeat 2 & 3 to add another graph.

Blog Comments powered by Disqus.