Contents

Michał Ostruszka - Developer Experience done right

Michał Ostruszka

05 Jun 2024.9 minutes read

Michał Ostruszka - Developer Experience done right webp image

At SoftwareMill, being a software delivery partner for our clients, we naturally have a chance to experience working on many projects of various scales, across different business domains. Each of these projects is different in terms of organization, ways of working, technologies and tools, coding practices as well as developer experience (DX) it offers. Sometimes we join well established projects with great developer experience already baked in, other times we either start fresh or join early stage projects where co-building a good developer experience is one of our objectives. Last but not least, sometimes we’re also explicitly hired to help assess and improve DX aspects in existing projects, leveling up clients’ “platforms”.

In this new series we’d like to describe some of the projects we’ve been involved in recently from the perspective of their developer’s experience and so called platform engineering - how it was done, what was great there, where it was lacking and what and how we thought could be improved.

Developer experience, what is it?

Developer experience is a relatively fresh “umbrella term” for all things related to ease of working with software and building software, as well as productivity and efficiency when doing so. It’s about everything ranging from development tools, languages, through documentation, workflows, automations to deployments, running and maintaining the software in production. It’s sometimes referred to in the context of platform engineering, as the latter simply builds the foundations for developer experience focusing on internal development platforms, necessary and custom tooling and automations, observability infrastructure, addressing security and compliance on the infrastructure level, etc.

For the first project we’d like to showcase, let’s go through some of the areas of its Developer Experience and see how it was addressed and built there.

Infrastructure provisioning and change management

Project has separate production (oh, well), staging and acceptance environments. While staging is meant to be used for in-development tests, feedback and integration, acceptance is used for automated testing of the entire platform as much as possible.
There is no need for developers to touch low level infrastructure parts as this is all managed and run by a dedicated SRE team, supported by the Platform Team. It’s a service (NOT micro-service) architecture based project, meaning there is no scale of hundreds or thousands of microservices, with new ones popping up every day, requiring constant resource and infrastructure provisioning. Due to that setting up and provisioning new infrastructure requests aren’t frequent and so are managed by regular change-management-flavored request tickets raised for the respective team.

Custom, shared services, and code

As in every project, there are some shared libraries, models, and abstraction unifying usage patterns for tools, etc., usually seen in the form of “commons” or “utils”. While we may make jokes about these, the truth is that providing a common and standardized way of working with some tools used across services, supporting and making it easier to use patterns levels up developer experience a lot. Developers don’t have to spend time setting up, configuring and building their own abstractions on top of existing services and libraries (like Kafka, config etc.), instead they use ones that are maintained by the platform team and constantly augmented with fresh solutions. In the long term, it lowers the cognitive load of the project and allows for standardization across services, so jumping between different services development is easier.

One thing that could be done better, though, was the inclusion of “shared” models in this common artifact/library. While it significantly speeds up development in the early stages, it may get troublesome after reaching a certain level of complexity. As most of the services share some domain models via this custom library (think: eg. Money etc.) introducing changes or new variants in these just for a few services may have a high impact on unrelated services. This is something that obviously can be addressed, but there are pros and cons of all possible options and they have to be weighted properly.

Delivery process

Service-based architectures, supported by dedicated teams together with a high engineering culture and great support from the platform / infrastructure, are the key to delivering value quickly. And I must say it’s a real pleasure to work that way when one doesn’t have to wait for big-bang releases happening every quarter or so and can just ship changes even several times per day if needed. Continuous integration pipelines building service images, versioned with git tags, and notifying change authors when done take all the heavy lifting away from developers. Although it’s not “continuous delivery” yet, and there is still the engineer having the last say and deciding when to tag new deployment and hit the “Deploy” button, the experience is quite smooth thanks to the efforts of SRE and Platform teams providing tools to semi-automate that on the road to full continuous deployment.

Obviously, frequent deployments don’t have to mean frequent releases, especially for incomplete functionalities. Feature flags (also known as feature toggles) to the rescue! They’re heavily used in order to keep features that are in development invisible to end users, while still allowing them to deploy to production chunk by chunk.

Another great thing is the widely used progressive rollout - first, open your shiny new feature to internal users, then to a fraction of users outside of the company, and keep rolling it out in batches to keep an eye on the adoption process and service metrics.

Service development kickstart

Speaking of metrics, every service has a built-in feature of reporting a basic set of metrics from the JVM and the container itself (CPU, memory, GC, threads etc.). Additionally, this shared library mentioned above integrates metrics and tracing into many abstractions it provides, so it’s basically no-brainer to have all the vital signals and traces from the application available in monitoring dashboards (gRPC endpoints stats, Kafka topics monitoring, Akka cluster state, database pools and calls etc.). With that setup adding new service-specific metrics is just a matter of a few lines of code.

But how is the new service getting ready to be worked on? As said above, due to the more coarse-grained nature of services it’s not often that a new service is needed, hence there is no automated “provisioning” that does all the heavy lifting. Instead, because the tech stack is mostly unified, there is just a minimal blueprint service repository with all the above batteries included that the developer has to copy and start building on top of it. That also means that one needs to fine tune all the deployment related artifacts (helm charts etc.) which isn’t the most pleasant task but it’s mostly one time shot only. Sure, it could be fully automated e.g. via web form or CLI tool, but weighting investment and maintenance needs vs actual needs and usage frequency it isn’t the most critical point on the roadmap.

Running and maintaining services

When the service is already up and serving users’ requests in production, we want to know when anything bad is happening, be it programming bugs that sneaked through the testing process, uncovered edge cases, configuration or integration issues etc. To support that, each service and team has a dedicated alerting configuration that notifies via respective Slack channels. These are both alerts coming from the service code itself where again this shared library provides a toolset to easily raise alarms programmatically or ones coming from the platform itself (exceeding memory limits, restart loops etc.). For troubleshooting all the important bits are integrated: logs, traces and metrics which make getting the data for the analysis more straightforward.

In addition to automated alerting there are Slack channels where important issues spotted by employees or customer support requests are dispatched for quick reaction. There is a set of Slack workflows built to support these requests so that ownership and progress is well communicated.

Knowledge base

When it comes to running in production, there is no dedicated, full blown services catalog in the form of e.g., Spotify’s Backstage etc yet, but there is ongoing research in that area. Instead, on the engineering wiki there is an organized space that works as a service catalog. Each service has a dedicated page with the most important information like short info on what it does, what tools it uses, which team and manager is responsible for one and how to deal with some of the incidents that may happen but haven’t been addressed properly yet, a.k.a. runbooks. Each runbook contains a clear set of instructions (with necessary commands, screenshots etc.) that any engineer picking up the issue should be able to follow, and in the worst case it contains escalation channels if the procedure doesn’t work out or is unsuccessful.

Regarding the high level picture of the system there is a place that holds most of the architecture and integrations in the form of C4 architecture diagrams, which is the de-facto standard for describing such structures. It enables quick navigation in order to find more technical details about the services and their landscape.

Unfortunately both of these (runbooks and diagrams) need to be kept in sync manually after any change, hence mentioned research around some automation tools is going on. There are many tools and “platforms for building platforms” available on the market and it may take some time to find one that isn’t overly complicated but still serves as a significant improvement so it’s worth investing in it.

SRE and Platform teams

Last but not least, let’s talk briefly about the people who are behind some of these improvements. While they may not be needed as a separate team from the beginning, over the course of the project, when it’s getting traction and naturally more and more complex they become indispensable. Quoting “Team Topologies” classification:

“Platform team: a grouping of other team types that provide a compelling internal product to accelerate delivery by Stream-aligned teams”

They’re there to make sure teams delivering so-called “business value” can work at full speed. They take care of the infrastructure, provisioning and maintenance, provide and build custom tools when needed, keep an eye on core dependencies updates across the services, automate what is worth automating so there is less time spent on a task and less chance of human error. Thanks to them, most of the tooling is standardized and familiar to engineers in all service teams instead of being just a bunch of team-grown scripts or tools hacked around quickly and unmaintained in the long run. Having said that, it’s not that regular developers cannot contribute to the tooling and “platform” itself. In fact a significant portion of the tooling has emerged from the teams’ internally addressed issues. Communication between platform and other teams is crucial here, in order to avoid misunderstandings, building or standardizing wrong things - at the end it’s all done to enhance the developer experience.

Summary

Building a good developer experience is not a one-time shot, rather it’s a continuous process. It’s not like deploying a tool that will solve all the issues for us. It takes time, requires experience, knowledge and usually dedicated and skilled people who know what and how to build to empower stream-aligned teams and serve them well. In the project described above, the Platform team was born and keeps growing organically, driven by the real needs of the remaining teams and not by the fact that it’s cool to have a Platform Team these days because all big names have it. With such an approach, only tools and improvements that are really worth doing are delivered. These may not be the immediate improvements but in the long run they will address the most concerning aspects in day to day development, making delivering features more pleasant and straightforward.

Blog Comments powered by Disqus.