KafkaPilot Under the Hood: Building an AI-Powered Kafka Diagnostics Tool in Rust
If you have ever gone beyond running Kafka on your local machine and moved your ideas to real-world production projects utilizing Kafka, you probably know the drill - Kafka clusters are complex, and the problems you can encounter within your Kafka infrastructure have endless possibilities.
Introduction: The problem space
There are multiple aspects to this, starting from your setup (which is complex by itself and you need to understand many things about Kafka to make it right and secure) and ending in day-to-day operations, where a new set of problems arise and you need to get to the PhD level of expertise to understand what is going on and how that can affect your business.
When investigating your Kafka cluster, you often need to gather data from multiple sources, check out the configuration, look into the logs, and check the JMX or Prometheus metrics to get started.
Many of the problems you can usually find on existing Kafka clusters are repeated and are common errors made by DevOps or other people on your development team who set the cluster up at the beginning of its life. It’s been running fine for ages until the time you need to investigate because something is off with your data.
At Softwaremill, we are trying to gather that tribal knowledge into a standard set of problems that can be automatically discovered on any Kafka cluster. For this exact purpose, we have created a new open-source project called KafkaPilot.
KafkaPilot is a CLI tool that helps you gather all the data and analyze it to identify common problems that can occur in your Kafka cluster. It utilizes LLM to execute each analysis task and is easily extensible, allowing us to add new tasks and solve more problems in the future.
KafkaPilot is part of SoftwareMill’s Innovation Hub and is currently in the MVP stage. While functional, it may contain bugs and has significant room for improvement. We welcome feedback and contributions.Treat this project as a base, invent your own tasks, and share them with the world by opening a pull request.
The tool is written in Rust, and is a simple CLI tool that you can run locally or on the bastion server. All you need to provide (at least for now) is the ssh access to your workers (or also to your bastion server if you run it locally and want to inspect the production cluster).
KafkaPilot scans your workers for configuration files, logs, and metrics, and stores them for analysis. Once the data is gathered, you can analyze it and produce a report showing you all the good and bad aspects of your current cluster setup. Currently, the tool has 17 built-in checks, but can be easily extended with new checks (tasks) which are created in yaml and are translated into LLM prompts with provided context from data gathered earlier.
This article is for Rust developers interested in building infrastructure tools (especially CLI-based) as well as for Kafka administrators and all the people interested in running Kafka on production systems.
As mentioned earlier, the KafkaPilot is in the early development stage; however, the main architectural decisions should probably remain in place for longer and will be described below, along with the inner workings of some more interesting components.
Architecture overview
KafkaPilot (at the time of writing this) is a simple tool. We can divide its functionality into several operations it performs: scanning, analysis, and report generation. Both scanning and analysis utilize LLM interactions to find the answers they need. It may be surprising that we use LLM for scanning data, but it's much faster and easier to get possible log file locations from configuration data when asking LLM to find them than to analyze all possible cases in code.
The code has been separated into different modules to help separate concerns: collection -> analysis -> reporting.
KafkaPilot doesn’t require you to install anything on target hosts (although it should be possible to run it from a bastion if you want to), and it’s pure SSH-based, agentless software. Assuming you have an OpenAI API key set up, and you are able to make an SSH connection to your cluster, you are good to go.
The scan step allows you to store data in a single directory, which you can later share and analyze multiple times without any further interactions with your Kafka cluster.
Module breakdown
Below, you can find a description of key modules and their responsibilities. Although some modules are not listed here and some descriptions do not fully reflect the complexity behind each module, you will gain an idea of how the project is structured and how it works in general.
src/scan/ - SSH orchestration and data collection pipeline
This module allows agentless data collection from remote Kafka clusters (in Zookeeper or KRaft modes). If you are running KafkaPilot directly from your machine through the bastion server and then to the brokers, it also scans some data about your bastion server. As mentioned earlier, it also utilizes LLM API interactions to discover logs on each broker, which is a complex process in itself.
src/snapshot/ - Snapshot data structures and serialization
This module is a collection of models for our cluster state representation, reflecting the snapshot directory structure we create during the scanning step (such as brokers, cluster, metrics, system, etc.). We utilise type-safe data structures for config, logs, and metrics, and serialize some data into JSON format for persistence and portability.
src/analysis/ - AI task execution engine
This module is the heart of the KafkaPilot, where we orchestrate AI-powered cluster analysis. To make things extensible and easier to understand we have created YAML-based task definition system where you can add new tasks by defining them in YAML format using template-based data injection with {{logs}}, {{config}}, {{metrics}}
so that you can add context easily to your prompts which are being sent over to LLM to produce structured response.
src/llm/ - OpenAI integration with fallbacks
The LLM module abstracts communication with the LLM API, adding flexibility. Currently, we support only OpenAPI, and the prompts are optimized against GPT-4o. Future releases may be able to work with local models, and when required, more sophisticated models will be incorporated. You can test multiple models available from OpenAPI with corresponding entries in the configuration.
src/report/ - Terminal, Markdown, JSON formatters
The last module is responsible for presenting all the findings we discovered during the analysis step. At the moment of writing, you can produce reports either directly in the terminal, as a Markdown file, or as JSON for further processing. We use severity classifiers so that you can distinguish between critical, high, medium, low, and informational levels. We try to add context to found issues and processed tasks to show you on which broker and exactly where the problem was found, so that you can take action on your real cluster.
Technology stack rationale
KafkaPilot has only a few dependencies; we have just a couple of dependencies for each module, primarily to parse data, produce reports, enable parallelization, and build an easy-to-use CLI tool.
The choice of Rust here was a no-brainer; although the current, relatively simple, implementation may not be immediately apparent, the decision to use Rust will definitely pay off. We test most of our LLM tasks against a small test cluster with just a few brokers. However, with Rust and tokio, we can easily process brokers simultaneously, with minimal overhead. Rust is also an excellent choice for parsing files and fast regular expression matching.
KafkaPilot is a single static binary which requires no virtualenvs, JVM nor npm install
on your machine. It can be compiled to multiple architectures and work in restricted environments.
The SSH-based collection system
Traditional monitoring tools require agents and software installation on target hosts, which can be problematic, especially for enterprises with some security approval processes in place. It becomes a challenge to sync versions between our tools and the running Kafka across brokers, adding additional maintenance overhead when updating agents or the Kafka cluster itself.
While building KafkaPilot, the most critical feature in our minds was to make it as universal as possible, allowing it to be used in various scenarios and Kafka deployments. On the other hand, if we wanted to move forward from scan functionality and actually execute some checks in the analysis step, we couldn’t cover all the cases.
We have decided that for the first version of this tool, we will support SSH access only (through agent forwarding). It means that you need to have real SSH access to your bastion and brokers (or, if you run this app directly on the bastion server, only to your brokers).
KafkaPilot requires no software installation on brokers, works immediately if you SSH to your servers, and executes a one-time collection that can be analysed without a connection to your cluster after the scanning phase is completed.
The scanning step can be broken into the following phases:
- Phase 1: Brokers discovery
Most of the data we collect requires broker discovery so that we know where we can ssh into. If brokers are not reachable, we can quit early and report problems to the user. What we discover during this phase are broker IDs, hostnames/ips (for further SSH connections), and if rack awareness was configured (affects some of the analyses)
- Phase 2: Test connectivity
In this phase, we test if we can connect to the discovered brokers at all, also failing and reporting early to the user if that cannot be achieved. At this step, ensure that SSH authentication works, the broker hostname resolution is correct, and we can execute basic commands through SSH.
- Phase 3: Collect per-broker data
This phase takes the longest, and the running time is heavily dependent on the number of brokers we have on the given cluster. We collect the following data from each broker (subject to constant changes):
- Process info (Kafka Java processes, getting PID, memory usage)
- Configs (server.properties, log4j config, JVM options)
- Dynamic log discovery (described later)
- System metrics (disk space, CPU count etc)
- Other data directories
- Phase 4: Generate snapshot
Once all the data for each broker is collected, we create a snapshot of all our findings together with a COLLECTION_SUMMARY.md
file. The scan data is divided into data belonging to each broker as well as cluster-wide data (topics metadata, consumer groups, and general cluster config)
Example summary after the scan command completes:
# Kafka Cluster Data Collection Summary
## Collection Details
- **Timestamp**: 2025-09-19 06:39:47.118406 UTC
- **Output Directory**: test-scan
- **Bastion**: kafka-poligon (remote)
- **Total Brokers**: 6
- **Accessible Brokers**: 6
## Data Collected
### Cluster Level (from Bastion)
✅ Broker list and configurations (kafkactl)
✅ Topic list and details
✅ Consumer groups
✅ Prometheus metrics (kafka_exporter)
✅ Bastion system information
### Per Broker (from 6 accessible brokers)
✅ System information (CPU, memory, disk)
✅ Java/JVM information and metrics
✅ Configuration files (server.properties, log4j, etc.)
✅ Log files (server.log, controller.log, journald)
✅ Data directory information and sizes
✅ Network connections and ports
Output structure
Detailed Structure looks like the following (please note that some of the collected data is available only if the KafkaPilot detects additional tools on the bastion server, which can be used to collect data; for example, such a tool is kafkactl
, which is not required to run KafkaPilot, but some of the data can be retrieved with it).
If you don’t specify the output directory name it defaults to the kafka-scan-TIMESTAMP
format.
kafka-scan-2025-10-02T14-30-15Z/
├── scan_metadata.json # Machine-readable metadata
├── COLLECTION_SUMMARY.md # Human-readable summary
│
├── brokers/
│ ├── broker_1/
│ │ ├── configs/
│ │ │ ├── server.properties # Core Kafka config
│ │ │ ├── log4j.properties # Logging config
│ │ │ ├── kafka.env # environment variables
│ │ │ └── kafka.service # service properties
│ │ │
│ │ ├── logs/
│ │ │ ├── enhanced_discovery_metadata.json
│ │ │ └── journald_service_logs
│ │ │
│ │ ├── metrics/
│ │ │ └── jstat_gc.txt # to be implemented
│ │ │
│ │ ├── system/
│ │ │ ├── cpu.txt
│ │ │ ├── disk.txt
│ │ │ ├── lscpu.txt
│ │ │ └── processes.txt
│ │ │
│ │ └── data/
│ │ └── log_dirs.json # All log directories
│ │
│ ├── broker_2/ (same structure)
│ └── broker_N/
│
├── cluster/
├── kafkactl/
│ └── some common kafkactl outputs
└── system/ # Similar to system data found on each broker
Advanced log discovery: The 5-step chain
With traditional Kafka monitoring tools, the log paths are usually hardcoded like /var/log/kafka/*.log
or /opt/kafka/logs/server.log
, but sticking to these conventions easily breaks, as each cluster can have it configured completely different:
- Logs go to non-standard locations
- Multiple Kafka instances run on the same host
- Logs are sent to journalctl (systemd stdout)
- Custom log4j configurations redirect to different paths
- Environment specific naming conventions are used
KafkaPilot divided the log scrapping functionality into 5 separate steps, which is a bit more universal.
First, we are trying to get the exact PID of the Kafka running process with ps aux
. From the output of the process list command we can get the arguments used to start Kafka, like log4j.properties location
, server.properties
used etc.
Once we have the PID, we can execute another command to get the service name for our Kafka process using the systemctl status
… command. The service definition contains environment variables and configuration paths that aren’t visible in the process list.
With another command executed, once we know the service name, we can extract details such as Kafka’s exact location, the ExecStart
command with all its arguments, etc.
Example service file for Kafka looks like the following:
[Service] EnvironmentFile=/etc/default/kafka Environment="LOG4J_CONFIG=/etc/kafka/log4j.properties" ExecStart=/usr/bin/kafka-server-start /etc/kafka/server.properties
Assuming that at this point we are almost sure what log4j.properties
file is the right one to parse we can extract the logs dirs locations we are interested in.
It turns out that such files can be quite complex on their own, so it's much easier and more reliable to send them over to LLM and ask for possible logs locations and whether the running instance is also producing logs to standard output as one of the appenders.
If this approach to finding logs fails, we have a couple of fallback strategies we can use, such as using common service names to see the details or assuming standard log paths.
This approach allows finding logs on many different Kafka installations without user input. It works with Zookeeper as well as KRaft versions. By using LLM to parse log4j.properties
file we are more flexible, and there is a higher possibility that we won’t miss any new way of configuring logging in Kafka.
The scan module is not only about fetching the latest logs; it also copies over found configuration files for each broker, as well as system information such as the number of CPUs and others.
AI-powered analysis engine
One of the most important features of KafkaPilot is the ability to easily add or modify existing tasks, which are executed with LLM against your scanned data. For this exact purpose a special task system architecture was designed, which allows the user to add new analysis capabilities without even recompiling the project.
The YAML task system is based on externally defined YAML files under analysis_tasks
directory (with a subdirectory structure for different kinds of tasks executed for KRaft, Zookeeper, or both kinds of cluster setup.
Each task created as a yaml file mirrors the structure defined in src/analysis/task.rs
.
pub struct AnalysisTask {
pub id: String, // Line 12: Unique identifier
pub name: String, // Line 15: Human-readable name
pub description: String, // Line 18: What this analyses
pub prompt: String, // Line 22: The actual LLM prompt with placeholders
pub include_data: Vec<String>, // Line 26: Data types to include (config, logs, metrics)
pub severity_keywords: HashMap<String, String>, // Line 30: Maps keywords to severity
pub default_severity: String, // Line 34: Fallback severity
pub category: String, // Line 38: Finding category
pub enabled: bool, // Line 42: Task on/off switch
pub cluster_type_filter: Vec<String>, // Line 50: "kraft", "zookeeper", or both
pub per_broker_analysis: bool, // Line 54: Split by broker to avoid token limits
pub max_tokens_per_request: usize, // Line 58: Default 100,000 tokens
}
The field names are self-explanatory, but it all becomes clear when we look into a simple example (taken from analysis_tasks/configuration/general/jvm_heap_size_limit.yaml:1-85
).
d: jvm_heap_size_limit
name: JVM Heap Size Limit Check
description: Verifies that JVM heap memory settings are not set above 8GB to prevent performance degradation
category: configuration
prompt: |
analyse Kafka broker JVM heap memory settings to detect excessive memory allocation:
System data: {system}
Configuration data: {config}
IMPORTANT: Extract the EXACT Xms and Xmx values from each broker's Kafka process command line.
CRITICAL: Apply analysis rules CONSISTENTLY to ALL brokers. If brokers have identical configurations, they MUST receive identical evaluations.
Step 1: Extract JVM Heap Settings from Each Broker
1. Look for Kafka java processes in ps aux output or kafka_process.txt files for EACH broker
2. Find -Xms and -Xmx parameters in each java command line
………..
Each task is a rather lengthy prompt, which usually takes one or more placeholders like system
or config
, which are replaced with actual data from the scanned dataset while the analysis step is being executed.
There are multiple options available for the developer to choose from when building such a task, starting with which piece of data should be included in the final prompt, whether it should be run against all brokers’ data together or separately for each broker etc.
The tasks are picked up automatically from the tasks folder and executed using AIExecutor
, which filters out tasks not suitable for the detected cluster type:
/// Check if task is compatible with current cluster type
fn is_task_compatible(&self, task: &AnalysisTask, snapshot: &Snapshot) -> bool {
// If no cluster type filter specified, task runs on all cluster types
if task.cluster_type_filter.is_empty() {
return true;
}
// Convert cluster mode to string for comparison
let cluster_mode_str = match snapshot.cluster.mode {
crate::snapshot::format::ClusterMode::Kraft => "kraft",
crate::snapshot::format::ClusterMode::Zookeeper => "zookeeper",
crate::snapshot::format::ClusterMode::Unknown => "unknown",
};
// Check if current cluster mode is in the task's filter list
task.cluster_type_filter.contains(&cluster_mode_str.to_string())
}
After ensuring the task is suitable for the current analysis, the prompt is built by replacing the placeholders with actual data and then sent to the LLM with the system prompt, such as "You are KafkaPilot, an expert Kafka administrator…"
Each task always includes instructions to return valid JSON with a findings
array, which makes it consistent across all tasks for a predictable output format.
Handling token limits
For larger clusters or when a significant amount of data is included through automatic placeholder replacement, it's advisable to switch to per-broker analysis rather than analyzing all broker data at once. When constructing tasks, we need to be aware of the maximum context window available with the LLM we are working with. For GPT -4's it's 128k tokens (where a single broker's data is usually 10-20k tokens)
For each analysis task, the prompt is built, replacing the placeholder data (src/analysis/executor.rs:248
):
/// Build the prompt with actual data
fn build_prompt(&self, task: &AnalysisTask, snapshot: &Snapshot) -> Result<String> {
let mut prompt = task.prompt.clone();
let data_map = self.prepare_data(task, snapshot)?;
// Replace placeholders with actual data
for (key, value) in data_map {
let placeholder = format!("{{{}}}", key);
prompt = prompt.replace(&placeholder, &value);
}
Ok(prompt)
}
Key implementation insights
It's worth mentioning that all analysis tasks were created as async functions and are designed for concurrent execution. Although the MVP version uses sequential execution, parallel task execution with tokio:spawn()
is possible and easy to implement for further enhancements.
Single-task failure doesn’t stop analysis, if the tasks fail (e.g., due to exceeded tokens limit) the other tasks are executed regardless. The system was designed with extensibility in mind, and anyone can drop a YAML file into the analysis_tasks
folder, add new data to prepare_data
method, or implement a custom LLM provider with a new LlmService
implementation.
Last, but not least, the LLM interactions can be easily debugged with --llmdbg
flag provided, which records all interactions (request with the prompt and corresponding response from the LLM) into llmdbg.txt
file.
Multi-format reporting
A simple, yet readable reporting format was developed to help investigate Kafka clusters and possible problems found using KafkaPilot. We divided the tasks according to severity, and at the end of the analysis step, you can see a report with severity, colors, calculated health score (subject to change), and description of the findings (produced by the LLM itself).
At the time of writing this only 3 report formats are possible:
- terminal: colored output with severity indicators
- markdown: GitHub/GitLab-compatible reports
- json: machine-parseable for integration
Example report with a single task at the beginning included:
════════════════════════════════════════════════════════════════════════════════
KAFKAPILOT HEALTH REPORT
════════════════════════════════════════════════════════════════════════════════
📊 Cluster Information
────────────────────────────────────────
Mode: KRaft (modern, Zookeeper-free)
Timestamp: 2025-09-17 06:03:33 UTC
Tool Version: 0.1.0
📈 Analysis Summary
────────────────────────────────────────
Total Findings: 14
🔴 Critical: 3
🟡 Medium: 4
ℹ️ Info: 7
Health Score: 0/100
🔍 Findings
════════════════════════════════════════════════════════════════════════════════
ℹ️ Finding #1: JVM Heap Memory Preallocation Check - Report 1
Severity: Info
Category: Configuration
ID: jvm_heap_preallocation-001
Description:
All 6 brokers have properly preallocated JVM heap memory (Xms=Xmx=4G). This ensures optimal performance by avoiding heap expansion delays.
Impact:
Impact not specified
Remediation Steps:
1. No remediation provided
Risk Level: Medium | Downtime Required: No
────────────────────────────────────────────────────────────────────────────────
For each task executed and included in the final report, you can see the title, severity, category, ID, and description. When playing around with scanned data or directly with your cluster and then scanning the data again, you can choose to execute only particular analysis tasks if needed, more on that in the “Real-world usage patterns” section.
Markdown reports are similar to terminal ones but are more sophisticated, with a table of contents included and a more structured summary at the top.
Real-world usage patterns
As mentioned earlier, KafkaPilot is a two-stage tool. The first step is data collection, where we attempt to obtain the current cluster state in a non-invasive manner. We don’t make any changes to the cluster, retrieve the most important pieces of data (especially from workers), and prepare them for offline analysis.
The second analysis step is an AI-powered issue detection step, which executes multiple YAML-defined checks with gathered pieces of data as additional context. It can be run multiple times and produce reports in different formats.
Depending on how your SSH agent is configured, you can run KafkaPilot either directly from your machine (ssh into your bastion host and then ssh agent forward to your brokers) or run it directly from the bastion host itself to discover and ssh to your brokers from there.
# 1. Scan a cluster (data collection)
cargo run --bin kafkapilot -- scan --bastion kafka-poligon --broker kafka-broker-1.internal:9092 --output test-scan
The analysis step can be executed against your collected data with the scan command:
# 2. analyse the collected data with AI
cargo run --bin kafkapilot -- analyse ./test-scan --report terminal
Please note that for both of those steps, you need to have your OPENAI
API key configured in the .env
configuration file.
You can get more up-to-date information from the project README.md file or from the project documentation at KafkaPilot GitHub website.
Lessons learned and future directions
Building the first version of KafkaPilot was a valuable experience. It turns out that the scope, which this tool can possibly achieve, is vast and complex. The first version is essentially an MVP to explore the possibilities, identify the challenges we would need to overcome, and gauge industry interest in such a utility.
For those very reasons mentioned above we have decided to make it as an SSH-based CLI tool, which provides natural security boundaries, avoids additional complexities of installable software on your brokers, and is kind of a “zero footprint” utility, providing non-invasive diagnostic runs for production systems where event observability tools can become sources of instability or at least the sources of constant fear to not break something that already works.
We hope that these features will attract a larger user base and that users will share their challenging scenarios, enabling the creation of sophisticated LLM checks.
We also hope that creating a YAML-based analysis system will bring the entire checks creation process closer to Kafka operators, allowing them to add or extend existing checks easily.
Challenges encountered
The biggest challenge was the LLM output variability, even when working with the same model. The prompts had to be tested multiple times and modified along the way so that we made sure that the LLM responded in the fashion we expected, given the state of the cluster in the scanned data directory.
This issue cannot be easily overcome, new models need to be tested and with every new check we introduce some extensive testing that needs to be done. It helps that we have scanning and analysis steps completely separate, so testing already scanned data is much easier, but still this adds quite a significant time to the whole cost of KafkaPilot development.
Another aspect, also associated with LLM, is the token limit and how quickly we can hit the limits, especially on larger clusters. Some of the checks cannot be performed as one request to the LLM API for all brokers and need to be executed per broker instead. It adds time to the analysis step, but also gives us more consistent results. The token management feature of KafkaPilot is something that needs to be extended, together with the prompt/context building functionality, to make the requests smaller and provide only as much data as is really needed for a particular check.
Last but not least, the SSH choice for scanning functionality, while providing universality, adds operational complexity. The SSH agent forwarding needs to be set up correctly, which requires Kafka administrators to manage additional SSH keys and can add intricate debugging situations where error messages from nested SSH sessions are notoriously cryptic.
Future enhancements
It's worth mentioning that analysing Kafka clusters, especially when you've seen a couple of those, can become quite complex. Every cluster is different, and every Kafka administrator has their favourite tools to manage it. Not all the commands we currently use in KafkaPilot will be available on all the clusters it will be run on, so errors and missing data scenarios are expected.
The tool is in its early stages of development, and the most crucial factor for it to become usable by a greater number of people is input from the community.
Besides making the tool more flexible and able to run on a larger number of possible Kafka clusters, there are some features we are still having in our roadmap, mainly about supporting other LLM models (including the local ones), adding parallelization to analysis tasks, and enhancing Zookeeper data collection, just to name a few.
See the GitHub project for more details. The project is open source under the Apache 2.0 license, and you are more than welcome to contribute and share the scenarios you would like to include in future releases of KafkaPilot.