What I learned about Kafka through Confluent Certified Developer for Apache Kafka (CCDAK) certification
CCDAK certification focuses on developing Kafka applications and is directed to developers with basic hands-on experience with Apache Kafka. In this blog, I want to share the insights I gained while preparing for the Confluent Certified Developer certification in Apache Kafka.
Source: certificates.confluent.io
When to consider certification
CCDAK certification is a good choice if you would like to confirm your advanced general Kafka knowledge or dive deeper into details to fully understand its inner workings and parts. One of the official prerequisites to the exam is A minimum of 6-12 months hands-on experience using Confluent products
, so it is good to have at least basic knowledge of Kafka.
Practical insights gained from preparing for the CCDAK exam
In my opinion, the CCDAK exam puts the greatest accent on understanding Kafka infrastructure, its use cases, and the concept of stream processing in general. Moreover, it also checks knowledge and ability to use in real-world scenarios, various Kafka-related products like Kafka Connect or Kafka Streams, and the Confluent products, e.g., Confluent Schema Registry.
From my perspective, first of all, I deepened my knowledge about the internals and configuration of Kafka producers, brokers, and consumers. I had the opportunity to expand on already known topics such as general Kafka infrastructure or Connect, but I also learned a lot about things that were barely known to me before - like developing ksqlDB applications along with processing Kafka events using SQL-like queries, or even the unknown concepts such as MirrorMaker.
What I liked is the fact that to pass the CCDAK exam, you need to obtain some practical knowledge about optimizing Kafka for various performance requirements - like throughput, latency, reliability, etc.
My main sources of knowledge were Kafka: The Definitive Guide (available for free), Kafka documentation, and self-paced Confluent Trainings. Both provided a well-organized learning experience, guiding through Kafka’s concepts in a structured manner. The way the material was broken down into specific topics and components made it easy to follow.
If you are looking for some more verified resources for learning, check out this listing proposed by Michał on the SoftwareMill blog. I also recommend a great SoftwareMill Kafka Visualization tool to experiment on your own and simulate how data flows through a replicated Kafka topic.
The most interesting concepts from my CCDAK preparation
Here is a brief summary of some of the concepts that I subjectively found most interesting. It assumes that you already have a basic knowledge of Kafka. If you are not familiar with Kafka and stream processing concepts in general, you can refer to this article about Kafka use cases by Maria on the SoftwareMill blog.
What makes Kafka fast?
Kafka is optimized for high throughput. This is one of the first terms you may come across about Apache Kafka. In other words, high throughput can be described as moving a large amount of data in a short, near real-time manner. There are famous examples of Uber or LinkedIn handling trillions of messages per day. But what makes it so fast?
The main reason is Kafka’s distributed architecture and partitioning mechanism relying on splitting topics into partitions. The ability to distribute the work across multiple nodes, as each partition can be assigned to a specific broker, makes Kafka highly scalable and is a core concept behind its high throughput.
However, where I deepened my knowledge during preparation for the CCKAD certification was in understanding of Kafka’s efficient resource management which is also a key to its speed. And especially two concepts - Sequential I/O access and Zero Copy principle:
Messages from Kafka topics are divided into partitions and stored directly in the physical storage of the Kafka Broker. Due to the immutability and append-only log nature of messages in Kafka, it is possible to leverage a sequential I/O access pattern that allows reading or writing data in a contiguous, ordered manner. Such operation is highly efficient and offers better performance than in random I/O access pattern.
Source: wikipedia.org
In traditional systems, data is stored in a local cache before being sent to clients. Thanks to the same message format across producer-broker-consumer, Kafka leverages zero-copy optimization. It improves the efficiency of transferring the data by avoiding some unnecessary memory allocation by sending the data directly from the filesystem into the socket buffer, from where it can be transmitted to the consumer clients.
Why Is Replication Critical in Kafka’s Architecture?
One of the core features of Kafka is its replication mechanisms. Each Kafka topic partition should have replicas - the exact copy - which are distributed among different brokers to ensure resilience. One of the replicas is called the leader which handles all writes and reads. The other replica's only job is to stay up to date and replicate all messages by pulling them from the leader - this is called to be in-sync with the leader.
Although partition leader election and in-sync replicas mechanism is handled behind the scenes on the server side, it is a crucial topic to deeply understand Kafka as it is the heart of its fault tolerance. I think it is one of the areas where I advanced my knowledge most during my CCDAK preparation, with emphasis on different perspectives:
Broker perspective
When a broker hosting the leader replica goes down, Kafka can automatically elect a new leader from the all in-sync replicas. What’s interesting, Kafka includes a preferred leader election mechanism (the first replica that was the leader when the topic was originally created). Once our broker is back online, Kafka can automatically reassign partition leadership back to the original leader.
What happens if there is no in-sync replica during the new leader election? First of all, we can protect ourselves from such a situation by setting min.insync.replicas config parameter. However, we risk availability that way as Kafka won’t accept new messages in cases when there are not enough in-sync replicas.
Without specifying min.insync.replicas, we have 2 options - first, wait for the original leader to be back online. Second - allow one of the out-of-sync replicas to become the new leader, although this leads to losing all messages that were written to the old leader while that replica was out of sync. This behavior can be controlled by unclean.leader.election.enable parameter.
Producer perspective
We can specify in Producers configs how many brokers (hosting partition replicas) need to acknowledge receiving the message before considering it successful by the producer acks parameter. Using acks=all ensures that all in-sync replicas have acknowledged the message which results in strong durability but some performance trade-offs. Alternatively, acks=0 means the producer would work in fire-and-forget mode. For balance between those, use acks=1 which means only the leader acknowledges the message. However, the leader can still crash before replicas copy the message, so there is also a potential to lose data.
Consumer perspective
The Consumers can read the data only from the leader replica, and only the messages that are previous to the high-water mark - which is the last committed offset across all in-sync replicas. That effectively means that consumers cannot read data that is not yet safely replicated.
Source: Kafka The Definitive Guide
What makes Kafka scalable?
Kafka is designed to be scalable. CCDAK certification puts special emphasis on understanding the main reason behind - the topic partitioning mechanism - the fact that despite being a part of a single topic, each topic partition can be hosted on a different broker and be evenly distributed among them. We can add or delete (scale out or in) a broker to a cluster to e.g. increase capacity, and Kafka could redistribute partitions among all brokers in the cluster. It allows to balance the load across the entire cluster.
The more partitions the topic has, the more consumers can read data from it in parallel. Kafka consumers are organized into consumer groups which can consume data independently from each other. Each partition can be consumed by only 1 consumer within the consumer group. The main way we scale the consumption of data from Kafka topics is by adding or removing consumers from a consumer group. When the number of consumers in the group changes, Kafka triggers a rebalance by moving ownership of the partitions from one consumer to another ensuring evenly load and continuous processing.
Source: Kafka The Definitive Guide
How does Kafka Streams work underhood?
Kafka provides a stream-processing library for building applications with data both inputted and outputted to Kafka clusters. Such an app is a separate JVM instance, so it doesn’t run inside of a broker. It interacts with Kafka by consuming data and writing the results back to the broker but the data is processed locally on the client side. The broker also ensures rebalancing and scaling in case a new Kafka Stream app instance joins or leaves.
During my preparation for the CCDAK certification, I learned how Kafka Streams handles stream processing under the hood by breaking the stream down into smaller units of execution called stream tasks, which are created based on the number of partitions in the input topic(s). Each task is responsible for processing data from one or more Kafka topic partitions and can be executed on a separate thread. So the maximum number of active tasks maps to the number of partitions of the input topic - this property effectively defines the maximum parallelism of a Kafka stream application.
What I find particularly interesting is how Kafka Streams handles failures in case of operations like aggregations. These operations are stateful, as they need to maintain and update a state based on processed messages. Data is processed by tasks locally, so Kafka Streams keeps a local state store with the help of embedded RocksDB. Each state store is backed up in Kafka Broker on a compacted changelog topic. Thanks to that, in case of failure, Kafka Stream can recover its state and continue operations on different node by simply reading the change log from the beginning. Full restore can take time, so Kafka Streams offers also stand-by tasks which keep the state in-sync with the other tasks, offering almost immediate restore.
Source: developer.confluent.io/courses/kafka-streams
Summarizing
During my preparation for the CCDAK, I had a bit more experience than the recommended 6-12 months of hands-on work with Kafka. Even so, I discovered many areas where my understanding deepened - some concepts were explained more clearly, others were entirely new to me, and I also picked up important details I had previously overlooked in familiar topics.
Overall, I’m satisfied with the structured learning process which helped me obtain certification and I recommend CCDAK certification as a nice way to both expand or validate practical knowledge of Apache Kafka.
SoftwareMill offers Kafka architecture consulting, development, deployment, and operation services in various environments (from bare metal through K8S to the cloud). If you would like to use Kafka, are using Kafka already, but would like to scale your deployment, or if you're wondering how to best architect the data flow in your system, it would be great to have you as our client—write us!
Reviewed by Michał Matłoka, Krzysztof Ciesielski