Understanding in-sync replicas and replication factor in Confluent Stretched Cluster 2.5

In multi-datacenter Kafka deployments, ensuring data durability and availability under failure conditions requires a careful balance between consistency and performance.
The Confluent Stretched Cluster 2.5 architecture introduces mechanisms such as observers and automatic observer promotion (AOP) to provide cross-datacenter resilience while minimizing operational complexity.
This post explains the role of replication.factor (RF) and min.insync.replicas (min.ISR) on the Confluent Platform Stretched Cluster, describing how they affect availability and how Confluent’s implementation differs from open-source Apache Kafka.
What Is a Stretched Cluster 2.5?
A stretched cluster 2.5 is a Kafka deployment where brokers are installed in two fully operational data centers (DCs), while a third, lightweight site hosts only metadata services – historically ZooKeeper or, in modern Confluent Platform versions, a KRaft controller quorum.
A broker hosts partition replicas; each replica assumes one of three roles:
- Leader – handles client requests (producers/consumers) for a given partition.
- Follower – maintains a synchronous replica of the partition.
- Observer – maintains an asynchronous replica, which does not participate in the ISR by default (Confluent Blog).
The ISR (In-Sync Replicas) list includes brokers fully caught up with the leader.
The configuration min.insync.replicas defines how many ISR members must acknowledge a write (with acks=all) to be considered committed.
To use observers and stretched clusters, you must define the number of replicas when creating a topic and then specify confluent.placement.constraints, which determines where each replica is placed and what role it will assume.
Different settings scenarios
Scenario 1 – RF = 4, min.ISR = 2
Let’s assume a scenario in which we want to have 4 replicas of every partition (RF=4), two per DC. With min.isr = 2, a write is considered successful once two ISR members acknowledge it— even if both acknowledgments come from brokers in the same DC.
To ensure data is acknowledged across both DCs, the topic can be configured as follows:
- DC 1: Leader + Observer
- DC 2: Follower + Observer
The replica placement policy looks as follows:
{
"version": 2,
"replicas": [
{
"count": 1,
"constraints": {
"rack": "dc1"
}
},
{
"count": 1,
"constraints": {
"rack": "dc2"
}
}
],
"observers": [
{
"count": 1,
"constraints": {
"rack": "dc1"
}
},
{
"count": 1,
"constraints": {
"rack": "dc2"
}
}
],
"observerPromotionPolicy":"under-min-isr"
}If the DC hosting the leader fails, Automatic Observer Promotion (AOP) can promote a caught-up observer to the ISR and elect it as a new leader. Promotion timing depends on replication lag and network health; it’s not bound to a fixed duration. During this transition, producers with acks=all cannot send new messages.
This configuration is also sensitive to overlapping failures (for example, one DC outage combined with a broker restart in the surviving DC).
Scenario 2 – RF = 4, min.ISR = 3
This scenario assumes the same replication factor as Scenario 1, but min.isr=3. With this config, all replicas are synchronous – no observers are used.
The replica placement policy looks as follows:
{
"version": 2,
"replicas": [
{
"count": 2,
"constraints": {
"rack": "dc1"
}
},
{
"count": 2,
"constraints": {
"rack": "dc2"
}
}
]
}This means:
- Writes require acknowledgement from three brokers.
- As long as at least three brokers are healthy, writes proceed normally.
- If an entire DC fails, only two ISR members remain, and producers with
acks=allwill block until either the failed DC returns ormin.ISRis temporarily lowered to 2.
This setup prioritizes consistency but reduces availability during full DC failures.
Scenario 3 – RF = 6, min.ISR = 3
With RF=6, each DC hosts two synchronous replicas and one observer:
- DC 1: Leader, Follower, Observer
- DC 2: Two Followers, Observer
The replica placement policy looks as follows:
{
"version": 2,
"replicas": [
{
"count": 2,
"constraints": {
"rack": "dc1"
}
},
{
"count": 2,
"constraints": {
"rack": "dc2"
}
}
],
"observers": [
{
"count": 1,
"constraints": {
"rack": "dc1"
}
},
{
"count": 1,
"constraints": {
"rack": "dc2"
}
}
],
"observerPromotionPolicy":"under-min-isr"
}This design provides strong durability and higher resilience:
- It tolerates single-broker restarts or upgrades with no downtime.
- On full DC failure, AOP promotes a caught-up observer from the surviving DC, restoring availability.
Therefore, 6 replicas with min.ISR = 3 is the most robust and operationally stable configuration for stretched clusters.
Confluent Platform vs. Open-Source Kafka
Confluent’s platform extends Kafka’s durability model with features such as asynchronous observers, stretched clusters (2.5-DC architectures), and replica placement constraints. These capabilities, fully available since Confluent Platform 6.1, enable automatic failover and stronger data consistency across regions. The Open Source Kafka model includes only leaders and synchronous followers, with a traditional replication factor.
Need assistance in choosing architecture and setting up Confluent Platform? As a cerified Premium Partner we are here to help you.
Conclusion
When deploying a Confluent Stretched Cluster 2.5, the configuration of 6 replicas and min.ISR = 3 with a proper placement policy provides the best balance of availability and durability, ensuring writes are safely replicated across both DCs while maintaining business continuity during outages.
Simpler configurations like RF = 4, min.ISR = 2 might work under ideal conditions but can lead to data loss or downtime during combined broker and DC failures.
Confluent’s unique features – Observers with Stretched Cluster 2.5 – offer capabilities unavailable in open-source Kafka, delivering safer failover and stronger data consistency for mission-critical environments.
See also: KCPilot – our new tool that proactively detects and resolves common issues in Apache Kafka before they become production headaches. It covers 17 diagnostic scenarios so far – from tricky configuration mistakes to availability and performance bottlenecks. Read more.
Reviewed by: Jakub Papuga and Krzysztof Ciesielski

