Contents

Contents

5 Common Kafka Admin Problems (And How KCPilot Detects Them)

5 Common Kafka Admin Problems (And How KCPilot Detects Them) featured image

Were you ever responsible for managing a Kafka cluster running in production, assuming that it’s not trivial and handles thousands or maybe millions of messages daily, you probably had many sleepless nights where you were thinking how big of a mess you would need to solve next day at work?

Not to mention situations where you were simply on call and, with proper monitoring setup, you had to solve them straight away that very night...

At SoftwareMill, we manage multiple Kafka clusters for our clients, often with on-call situations, and, needless to say, it’s often not a pleasurable experience. Kafka is powerful but also really complex.

Simple configuration mistakes can have serious consequences, and in situations where we inherit existing setups, and often when we try to fix some basics of those done wrongly from the beginning, we have a period of time where we need to give 110% of our experience and Developer/DevOps brains to make things work as they should.

Kafka is a tool that has been in existence for over 10 years and offers a plethora of tools to make life easier for developers and others setting up clusters. However, even with good monitoring, you still only see symptoms and not the root causes.

KCPilot explained: How can our tool help you?

Being firm believers in the quote work smarter, not harder, we try to leverage automation to reduce manual effort. For this very purpose, KCPilot was created. KCPilot is a tool that, although still in its early development stage, promises to make existing Kafka cluster setups check faster by identifying common problems that require your attention.

In short, KCPilot is basically a set of tasks that analyzes, with the help of LLM, the data associated with the running cluster to find common mistakes people make when setting it up. The tool first scans your cluster configuration, logs, and metrics (some of which also with the help of LLM) and executes predefined tasks to search for potential issues.

Retrieving the data to be analyzed is a complex task on its own, mainly because Kafka deployments can differ significantly in their architecture, the way brokers are started, and how logs and metrics are handled. There is no single standard or unified approach, and every production setup tends to do things slightly differently.

On top of that, Kafka itself doesn’t provide any simple commands or binary responses that would directly tell you whether the cluster is healthy or not. A good example of this complexity are logs, which require a multi-step process to locate and collect from the right locations (more on that later).

What%20is%20KCPilot%20and%20how%20to%20use%20it

Except for logs, we are trying to get the system info, environment variables, and other information on each broker and store that data for the use of the analysis step. Finding this bit of information is usually done by any person analysing existing Kafka clusters so that it can be evaluated and corrections (if needed) can be made. The tool significantly speeds up this process and continues to improve with each iteration, constantly increasing the level of automation and reducing the amount of manual work required.

You can read a lot more about KCPilot, its scanning process, or available analysis tasks by checking out the project itself (open source) or on the project website.

At the time of writing, KCPilot has 17 different tasks it can execute against collected data to identify potential issues with your Kafka cluster setup. Not all of them are executed against every Kafka cluster, as some of them are divided for KRaft or Zookeeper variations only. I have picked 5 which seem to be the most common and easy to fix by anyone responsible for the cluster health.

Of course, whenever you find yourself in a situation where you’re unsure about what to do, there is also a fallback option: write to us, and we will be more than happy to help you resolve your issues.

Now, let’s deep dive into the nitty-gritty details of my 5 favourites.

Problem number 1: Suboptimal thread configuration

One of the most subtle and technically nuanced configuration issues is related to the number of threads set for different aspects of running a Kafka cluster. Default values are often inadequate for production workloads, as they often don’t reflect what infrastructure the cluster is running on. Those configuration elements affect three critical thread pools: network, I/O, and replica fetchers.

For network threads, for example, too few threads lead to request queuing whereas too many add unnecessary context switching overhead.

I/O threads, on the other hand, need to match the storage characteristics of the underlying infrastructure (whether it is HDD, SSD, or NVMe). Having too many threads for slower storage can actually lead to inconsistent Kafka latency, as excessive parallelism may overwhelm the I/O subsystem instead of improving throughput.

Replica fetchers are also an important setting to be aware of, as they must scale with the partition count, or replication lag can occur.

KCPilot tries to detect suboptimal configurations in this space by scanning and analyzing server.properties file for thread configuration parameters. It extracts the partition counts from cluster metadata, detects the storage type from system information, and given a set of rules described in a YAML file: applies intelligent rules based on storage characteristics, like in the example below:

  • HDD: 2-6 I/O threads recommendation
  • SSD: 2-12 I/O threads recommendation
  • NVMe: 2-32 I/O threads recommendation

For the I/O threads configuration. Replica fetchers should be set around 1 per 500 partitions. Brokers with suboptimal configurations will be reported during the analysis stage in the final report.

KCPilot output example

WARNING: 6 out of 6 brokers have suboptimal thread configurations

- broker_11: I/O threads=12 with unknown storage (exceeds conservative limit)

- broker_12: I/O threads=12 with unknown storage (exceeds conservative limit)

[... all 6 brokers listed with specific issues...]

Recommendation: Adjust I/O threads based on storage type (HDD: 2-6,

SSD: 2-12, NVMe: 2-32). Monitor CPU usage and request latency after changes.

When the storage type cannot be automatically determined, KCPilot applies conservative defaults to minimize the risk of performance degradation. Monitor CPU usage and request latency after changes.

Problem number 2: Missing in-transit encryption

Surprisingly common in clusters that evolved from development to production is that all traffic gets transmitted in plain text (credentials, message payloads, metadata). That problem is quite easy to spot (relative to other checks we have or can create), but easily accumulates as a security debt until a compliance audit forces an action.

Many compliance requirements mandate encryption in transit, as without encryption, this security hole can be exploited with network-level eavesdropping or man-in-the-middle attacks. This aspect is essential for systems dealing with financial transactions or other business-critical events, not to mention that security incidents erode customer confidence regardless of the type of business we run.

KCPilot parses listener.security.protocol from server.properties, checks for SSL, SASL_SSL or TLS protocol configuration, examines TLS version used and flags PLAINTEXT-only communication as CRITICAL severity.

KCPilot output example


CRITICAL: 6 out of 6 brokers have dangerous configuration without

in-transit encryption

Found PLAINTEXT-only communication. All Kafka traffic is transmitted

unencrypted, vulnerable to eavesdropping, data interception, and

man-in-the-middle attacks.

Brokers analyzed:

- broker_0: security_protocols=["PLAINTEXT"], ssl_protocol=null

- broker_1: security_protocols=["PLAINTEXT"], ssl_protocol=null

[... all brokers listed...]

This is a very simple but important check to perform on every cluster. It’s only a first-line defense and not a comprehensive security assessment, but it’s worth doing on every cluster you manage.

Problem number 3: Insufficient replication margin

Replication margin is a term every Kafka cluster administrator should be aware of and should know how it works, but for some reason, when it comes to real-life cluster setups, the reality can surprise us.

An insufficient replication margin is a hidden availability time bomb and should be addressed in every Kafka cluster, as most probably Kafka was a technology of choice just to provide high availability. Watch out for settings in default.replication.factor and min.insync.replicas. Without proper settings in this space, a single broker failure can take clusters into read-only mode or worse.

The margin is calculated by subtracting value of min_insync_replicas from replication.factor:

Margin = replication_factor - min_insync_replicas

As high-availability is Kafka’s core value proposition, production clusters need a margin >= 2 for safety. A margin of 1 means that only one broker can fail before writes stop and a margin of 0 or negative means that you are in a very dangerous state.

Problem number 4: JVM heap size misconfiguration

I have added the heap size misconfiguration as it's an interesting configuration option many people get wrong. It turns out that more memory isn’t always better for Kafka, allocating 16GB, 32GB, or even more heap to brokers can degrade performance of your Kafka cluster.

Of course, heap memory is important and should be set but Kafka’s architecture relies heavily on OS page cache rather than heap.

Large heaps typically mean longer garbage collection pauses, which in turn block broker threads and can cause request timeouts on your cluster. All the memory you allocate to the heap is not available anymore for page cache, so divide your memory wisely. Whenever you experience periodic latency spikes every few minutes, it could be a good sign that your memory settings are not defined correctly.

KCPilot scans process information with ps aux, extracts -Xms and -Xmx JVM parameters and parses the memory values you currently have applied when starting up the Kafka process. The brokers with heap over 8GB are flagged as WARNING severity.

KCPilot output example


WARNING: 2 out of 6 brokers have excessive JVM heap memory allocation

above 8GB limit

Brokers with issues:

- broker_1: Xms=12G, Xmx=16G

- broker_3: Xms=10G, Xmx=10G

Excessive heap sizes can cause performance degradation due to longer

GC pauses.

Problem number 5: Hidden production errors in logs

This task is just an example of how logs can be utilized to find a plethora of problems on the Kafka cluster.

We can use logs for multiple different issues we can define in our tasks. For now, we just find error logs and report them to the user when we find any. The problem with logs on the Kafka cluster is that they can be scattered across multiple locations and formats. Manual log review is time-consuming and error-prone, and if you are looking for specific ones, you can easily add new tasks to the set and detect the errors you are looking for during each scan and analyze cycle.

Log discovery, implemented in KCPilot, uses multiple steps to ensure we are getting the right logs from the cluster for further analysis.

Log discovery process:

  1. Get the Kafka process running PID using commonly available ps aux
  2. Extract.properties and.conf files from the command arguments if possible
  3. Extract -Dlog4j.configuration= or -Dlog4j2.configurationFile= JVM arguments

Example:

KafkaProcessInfo {
pid: 1234,
command_line: "java -Xms1G -Xmx1G -Dlog4j.configuration=/opt/kafka/config/log4j.properties kafka.Kafka /opt/kafka/config/server.properties",
config_paths: ["/opt/kafka/config/server.properties"],
log4j_path: Some("/opt/kafka/config/log4j.properties")
}
  1. Get the systemd service name managing this Kafka process with systemctl status

Example:

SystemdServiceInfo {
service_name: "kafka.service"
}
  1. Extract additional data with systemctl cat kafka.service
  2. Get the /etc/kafka/kafka.env or Environment= directives

Example:

SystemdServiceInfo {
service_name: "kafka.service",
environment_file: Some("/etc/kafka/kafka.env"),
environment_vars: {
"KAFKA_HEAP_OPTS": "-Xmx1G -Xms1G",
"KAFKA_LOG_DIR": "/var/log/kafka",
"LOG_DIR": "/data/kafka-logs"
}
}
  1. Use LLM to extract log information from log4j.properties file
  2. Fetch the logs for further analysis

While different logging destinations (files, stdout) complicate monitoring, early error detection prevents outages, and some of them may indicate broker instability before a crash.

KCPilot output example

WARNING: This broker shows error conditions in logs from the last 24 hours

Error summary:

- 5 ERROR messages

- 1 FATAL message

- 3 WARN messages

Categories: network_connectivity, replication

Sample errors:

- ERROR [ReplicaFetcher replicaId=1] Error for partition topic-1-0

- FATAL [KafkaServer id=1] Fatal error during startup

- WARN [NetworkClient] Connection to node 11 could not be established

How KCPilot works (architecture overview)

KCPilot is a CLI tool that allows you to gather and analyze your cluster data. For this purpose, you have two main commands available, one for scanning the cluster and the other for analysing.

KC%20Pilot%20cluster%20analysis

The scanning process is an SSH-Based data collection method where (at the time of this writing) you need to have SSH connection capability to your brokers and/or bastion host of the cluster. The data is collected into structured, timestamped directories, and should work with most Kafka installations as long as you have SSH access.

The analysis step is an AI-Powered problem detection algorithm that loads collected data from the scan directory and executes tasks defined in the analysis_tasks folder in the form of YAML files (currently 17+ built-in tasks).

Each task is a YAML file with a structured prompt for LLM analysis, additional context in the form of injection points from scanned data, and some mappings of problems to severity, which should be reported in the output. That task is sent to the LLM, together with specific pieces of data from the scan directory, and the responses are gathered together to produce the final report.

Users can specify the format in which the report should be generated, such as a color-coded, human-readable summary displayed directly in the terminal, detailed markdown report, or just JSON for automations or other dashboards if needed.

What is important to note is that KCPilot is not modifying the cluster in any way, and its read-only data collection. Once the data is gathered during the scan process, it can be analyzed multiple times.

KCPilot allows easy addition of new tasks as YAML files, so everyone with their own experience managing Kafka clusters can add new tools easily and not forget about some checks when analyzing new clusters in the future.

For exact details on how KCPilot works under the hood, you can visit an earlier article on this topic, visit the project GH Pages, and the project itself.

Conclusion

We have covered five common Kafka admin problems that KCPilot detects automatically, but the project (at the time of writing) has 17 of them.

Most of the problems covered here or found in other tasks in KCPilot, share common characteristics: they are easy to misconfigure, sometimes hard to diagnose manually, can have a serious production impact, and are preventable with proactive detection and just a small bit of work.

Instead of manually SSH into each broker, look for configuration files and parse them one by one, risking overlooking subtle issues and spending hours digging into the hard-to-read formats, you can just execute two KCPilot commands and wait for a tool to do the job for you.

We are working hard to make it work with any Kafka deployment, and hopefully, with some help from the community, we will grow the number of tasks we can execute against any Kafka cluster out there.

Beyond the 5 aforementioned problems we cover others like:

  • ZooKeeper and KRaft configurations
  • High availability checks
  • JVM memory optimizations
  • Storage configuration
  • Performance tuning
  • Security hardening

Use KCPilot to automate the work you usually do in Kafka clusters, try it, read about it on the project website, and contribute your analysis tasks as easy-to-write or extend YAML files.

Blog Comments powered by Disqus.