Get to know your system better
I remember times (or projects from the beginning of my career) when nobody was talking and caring about system monitoring. Applications were deployed rarely, usually by sending fat JAR/WAR bundles over FTP to clustered WebSphere or early JBoss and JMX was used only to troubleshoot some serious issues by super-high-level folks, if at all. There were no knowledge on technical metrics for the masses, dashboards, alerts - no application insights except simple: now it works and now it doesn’t. Thankfully, since then, things have dramatically changed and nowadays, project teams can benefit from a wide range of tools and approaches to have virtually unlimited view into system internals.
Although all the tools are available, getting to the point where one can say that a given system is being well monitored/measured is a long journey. Let’s look at a few key points and tricks that may help you get there faster, with less effort and get the most out of it.
Absolute must - basic resources monitoring
Do you know how much memory your applications use? How do these numbers differ when your application is more in an “idle” state vs when it’s busy serving users in load peak? How is GC in both cases? How is CPU usage? Is there throttling happening when load increases? If you don’t know these numbers, or at least you aren’t able to get this data immediately when asked, you should seriously consider investing some time in making that happen.
In the long run, you’ll benefit from that twofold. First - you’ll have data to support resource adjustments in order to optimize infrastructure costs.This is especially important nowadays where a great number of projects get run in the cloud where you pay proportionally to resources booked/reserved for your app, but it’s also important when running in on-premise mode. Second - data collected over time can give you insight into how your application works on the lowest level. Is there a need for more CPU for load spikes to serve users efficiently, does it need more RAM to better handle traffic, does GC work as expected or maybe it requires some tweaking, is your heap big enough, and, last but not least, do you maybe have a memory leak (you recall your app crashing from time to time for no obvious reason)?
Good news is that most cloud platforms already provide a subset of this data for you in some form (eg. per k8s pod resources usage etc), but in order to get the most out of it, you may want to expose more metrics (like GC times etc.) from your application and have a way to aggregate and display them. I’m not gonna suggest any tools here as your mileage may vary but rest assured, you have tons of options out there.
Make it a no-brainer to collect metrics
Have you ever wished your app had a tiny feature X while you were working on something else but it was so complicated to enable it that you didn’t even consider doing that? What if adding that feature was just a few tweaks and a few lines of code away? The same applies to adding any metric to your codebase while working on a feature. It would be great to have these stats but if setting up all that collecting-related code, building dashboards etc. takes ages and is complicated enough, you’ll give up and in the best case, you’ll create a ticket to take care of it later (in the worst case, it’ll never happen). This is where investing some time in having good tools/helpers pays off. If you can expose a metric with just a few (sometimes even one) lines of code when directly working on a feature, it’s the best you can get - you bake in the right metric (as you perfectly understand the use case) in the right place in almost no time.
In my current project, with many different services, there is a set of helpers to setup and configure new services in a way they immediately start exposing some core metrics (CPU, memory, HTTP requests, DB connection pools, etc). Complementary to that, there is a dashboard that is generic enough that if a given service is configured that way, it will be available and its metrics would be visualized virtually for free. This is priceless to have that up and running from the start with almost no effort and it really pays off later.
More tech content from the SoftwareMill team:
Know what you collect
If only you are able to bake in metrics collection right from the start when working on a feature, it’s the best thing you can do. You exactly know the paths, all the logic, what and where to measure. Yet it happens that metrics get added as an afterthought when feature development is long closed and shipped. It’s not a big deal but that requires some more effort to get this right.
Also, when reworking some part of the codebase or removing features, try to remember to make sure metrics you collect are still valid. It may happen that some metrics are no longer valid after feature rework and can give some nonsense numbers. While it’s not harmful for the application logic itself, if it influences some business decisions, you’d better keep your metrics correct. Remove metrics collection when no longer used as well as corresponding empty dashboard charts, tweak existing ones, etc. That’s where having easily editable dashboards can be of huge help.
Get familiar with monitoring UI and query language
Sure, you can have a beautiful UI where you can click through a wizard-form to get you from zero to a pretty-looking, impressive dashboard. And that’s ok, all in all, you have what you wanted - better application insights. But over time metrics change, new insights come in, old metrics stop being useful, some get removed, leaving graphs empty. How can you deal with that in the long run? We already know the answer from other fields: keep your dashboard definitions in code, version-controlled in the same way you use for application codebase, automatically deployed. While it's a relatively new thing, some solutions already exist in the wild to research.
It may also happen that you need to view metrics differently, have custom ad-hoc projection of the metrics just for one-time use. For that, it’s good to be familiar with at least the basics of your observability infrastructure’s query language, metric types and have overall knowledge of what you can get from it. Sometimes you don’t even need to introduce a new metric because all the data is already there, it’s just a matter of projecting it differently.
Know the limitations
That point may slightly vary across a set of observability stacks of choice, but the general rule is that your metrics will only show you the overall picture, trends across time. There is no way to have data collected and projected with milliseconds precision. This is related to time resolution of metrics reporting and the way you query them.
For example, if there was a burst in the number of incoming requests, but it was short enough, you may not be able to see it on your dashboard. Sometimes all you can see is a trend whether the number of requests was increasing or decreasing over time.
It may look like a limitation, but if you have your system monitored correctly, with many different metrics, you’re able to quickly conclude what happened and where.
On the other hand, it may be a good thing if you set up your alerting rules based on that. You’re probably more interested in general trends than in outliers to trigger an alert.
Long story short: observability is not like tracing where you see every single trace of what you program did, what was the request's path through your system - you should know what you look at and learn what conclusions you can derive from your metrics.
More tech content from the SoftwareMill team:
So to recap:
- If you don’t have an insight into your application, you’re missing tons of information and exposing yourself to tons of risks, not only technical ones, but business ones as well. What if your service isn’t fast enough on some execution paths and users start leaving it for a competitor?
- Make it easy for developers to hook in metrics collection as well as have at least basic visualization in place. It really pays off when there is a need to troubleshoot resources usage, some weird bugs happening, etc. No more guesswork.
- Get familiar with what you can get from your metrics, how you can project them to get the information you want. Sometimes it’s just a few tweaks away to get a really valuable set of data from the metrics you already have.
- Know the limitations of your observability platform, what you look at and how to make decisions based on the numbers you see. Remember, it’s usually not as detailed as you may want it to be at first, but instead of trying to work around that, ask yourself if that (together with other metrics) isn’t enough.