Circuit Breaker Pattern with Kafka: A Complete Guide - Part 3
Circuit Breaker Pattern with Kafka: A Complete Guide
Part 3: Challenges, Edge Cases, and Alternatives
Series Navigation:
- Part 1: Understanding the Fundamentals
- Part 2: Implementation and Real-World Scenarios
- Part 3: Challenges, Edge Cases, and Alternatives (You are here)
- Part 4: Configuration, Testing, and Best Practices
The Challenges You'll Face
While the circuit breaker + Kafka pause pattern is powerful, it comes with real challenges. Understanding these upfront will save you from production surprises.
Challenge 1: Consumer Group Rebalancing
When you pause a Kafka consumer, you're interacting with Kafka's consumer group protocol. This can cause unexpected behavior.
The Problem
Timeline:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΆ
T0: Consumer A owns partitions 0,1,2
Consumer B owns partitions 3,4,5
Circuit is CLOSED
T1: Downstream fails, circuit OPENS
Consumer A calls pause()
Consumer B calls pause()
T2: Consumer A's pause takes longer
Kafka thinks Consumer A is dead
Initiates rebalance
T3: Partitions 0,1,2 reassigned to Consumer B
Consumer B has OPEN circuit too
But it's a NEW consumer for these partitions
T4: Confusion about which offsets to resume from
Why This Happens
pause()stops fetching, but consumer must still send heartbeats- If processing is blocked AND pause takes too long, heartbeat fails
- Kafka coordinator thinks consumer is dead
- Rebalance triggers
Solutions
Solution 1: Static Membership (Kafka 2.3+)
Static membership preserves consumer identity across restarts:
spring:
kafka:
consumer:
group-instance-id: consumer-instance-1 # Unique per instance
session-timeout-ms: 300000 # 5 minutes
Benefits:
- Consumer can pause for extended periods
- No rebalance during pause
- Identity preserved
Solution 2: Cooperative Sticky Assignor (Kafka 2.4+)
Reduces rebalancing impact:
spring:
kafka:
consumer:
partition-assignment-strategy:
org.apache.kafka.clients.consumer.CooperativeStickyAssignor
Benefits:
- Incremental rebalancing
- Only affected partitions move
- Less disruption
Solution 3: Increase Session Timeout
Give more time before Kafka considers consumer dead:
spring:
kafka:
consumer:
session-timeout-ms: 60000 # 60 seconds (default is 10s)
heartbeat-interval-ms: 20000 # Should be 1/3 of session timeout
Challenge 2: Distributed Circuit Breaker State
In production, you have multiple consumer instances. Each has its own circuit breaker. They can have different states.
The Problem
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONSUMER GROUP β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Instance 1 Instance 2 Instance 3 β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Circuit: β β Circuit: β β Circuit: β β
β β OPEN β β CLOSED β β HALF-OPENβ β
β β β β β β β β
β β Paused β β Running β β Testing β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Problem: Inconsistent behavior across instances
Why This Happens
- Each instance experiences failures independently
- Network conditions vary
- Some instances may hit the failing endpoint first
- Circuit breakers don't share state by default
Solutions
Option 1: Accept Eventual Consistency
This is often the pragmatic choice:
- Each instance protects itself
- Eventually all circuits open
- Slight inconsistency is acceptable
- Simplest to implement
Option 2: Shared State with Redis
Store circuit state in Redis:
public class DistributedCircuitBreaker {
private final RedisTemplate<String, String> redis;
private final String circuitKey = "circuit:external-api";
public boolean isOpen() {
String state = redis.opsForValue().get(circuitKey);
return "OPEN".equals(state);
}
public void open() {
redis.opsForValue().set(circuitKey, "OPEN",
Duration.ofSeconds(30)); // Auto-expire
}
public void close() {
redis.delete(circuitKey);
}
}
Trade-offs:
- Adds Redis dependency
- Adds latency to every check
- Redis itself can fail
Option 3: Leader-Based Coordination
One instance is the "leader" that makes circuit decisions:
// Using Curator for ZooKeeper-based leader election
LeaderLatch leaderLatch = new LeaderLatch(client, "/circuit-leader");
leaderLatch.start();
if (leaderLatch.hasLeadership()) {
// This instance decides circuit state
broadcastCircuitState();
}
Trade-offs:
- Complex setup
- Leader election overhead
- Single point of decision
Option 4: Gossip Protocol
Instances share circuit state peer-to-peer:
- Eventually consistent
- No central coordinator
- More complex implementation
Challenge 3: The Polled Messages Problem
We touched on this in Part 2, but it deserves deeper exploration.
The Detailed Problem
Kafka Consumer Poll Cycle:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΆ
1. consumer.poll(Duration.ofMillis(100))
βββ Returns batch of 500 messages
2. Start processing message 1
βββ Call external API
βββ API fails
βββ Circuit breaker records failure
3. After 5 failures, circuit OPENS
βββ Call consumer.pause()
4. BUT: Messages 6-500 are still in memory!
βββ They continue to be passed to your handler
βββ Each one fails
βββ You can't "un-poll" them
Comprehensive Solution
@KafkaListener(id = "myListener", topics = "orders")
public void consume(
List<ConsumerRecord<String, Order>> records,
Acknowledgment ack,
Consumer<?, ?> consumer) {
int successCount = 0;
for (ConsumerRecord<String, Order> record : records) {
// Check circuit BEFORE each message
if (circuitBreaker.getState() == State.OPEN) {
log.info("Circuit open, rejecting remaining {} messages",
records.size() - successCount);
// Seek back to first unprocessed message
consumer.seek(
new TopicPartition(record.topic(), record.partition()),
record.offset());
// Pause immediately
consumer.pause(consumer.assignment());
// Don't acknowledge anything
break;
}
try {
processOrder(record.value());
successCount++;
} catch (Exception e) {
// Seek back and stop
consumer.seek(
new TopicPartition(record.topic(), record.partition()),
record.offset());
break;
}
}
// Only acknowledge what we successfully processed
if (successCount == records.size()) {
ack.acknowledge();
}
}
Edge Case 1: Flapping Circuit
The circuit rapidly opens and closes, causing instability.
The Scenario
T0: Service at 51% error rate
T1: Circuit OPENS (threshold: 50%)
T2: Wait duration expires, HALF-OPEN
T3: Test request succeeds (service temporarily OK)
T4: Circuit CLOSES
T5: Resume processing
T6: Error rate back to 51%
T7: Circuit OPENS
... repeat forever
Solutions
Increase Success Threshold for Closing
resilience4j:
circuitbreaker:
instances:
myBreaker:
permitted-number-of-calls-in-half-open-state: 10
# Require more successes before closing
Implement Exponential Backoff for Wait Duration
public class AdaptiveCircuitBreaker {
private int consecutiveOpens = 0;
private Duration baseWaitDuration = Duration.ofSeconds(30);
public Duration getWaitDuration() {
// Double wait time for each consecutive open
return baseWaitDuration.multipliedBy(
(long) Math.pow(2, consecutiveOpens));
}
public void onOpen() {
consecutiveOpens++;
}
public void onClose() {
consecutiveOpens = 0; // Reset on successful close
}
}
Add Hysteresis
Different thresholds for opening vs closing:
Open threshold: 50% failure rate
Close threshold: 20% failure rate (must be MUCH better to close)
Edge Case 2: Long-Running Message Processing
A message takes 30 minutes to process. Meanwhile, the circuit opens.
The Scenario
T0: Start processing large message (expected: 30 min)
T1: Other messages fail, circuit OPENS
T2: Consumer pauses for new messages
T30: Long message finishes... but fails at the end
T31: Circuit is OPEN, can't retry immediately
T32: Message offset not committed
T33: When circuit closes, message reprocesses
T34: Another 30-minute processing begins
Solutions
Implement Idempotency
Ensure reprocessing doesn't cause duplicate effects:
public void processLargeMessage(Message msg) {
String messageId = msg.getId();
// Check if already processed
if (processedMessageStore.exists(messageId)) {
log.info("Message {} already processed, skipping", messageId);
return;
}
// Process
Result result = doExpensiveProcessing(msg);
// Mark as processed (atomically with result)
processedMessageStore.markProcessed(messageId, result);
}
Checkpoint Progress
For very long operations, checkpoint intermediate progress:
public void processLargeMessage(Message msg) {
Checkpoint checkpoint = checkpointStore.get(msg.getId());
if (checkpoint != null) {
// Resume from checkpoint
continueFrom(checkpoint);
} else {
// Start fresh
startProcessing(msg);
}
}
Edge Case 3: Partial Topic Pause
Consumer subscribes to multiple topics, but only one topic's downstream fails.
The Scenario
Consumer subscribes to: orders, inventory, shipping
Only "orders" downstream (Payment API) is failing.
Single circuit breaker β ALL topics pause
Even healthy inventory and shipping processing stops
Solutions
Separate Consumers per Topic
@KafkaListener(id = "ordersListener", topics = "orders")
public void consumeOrders(Order order) {
// Has its own circuit breaker
ordersCircuitBreaker.execute(() -> processOrder(order));
}
@KafkaListener(id = "inventoryListener", topics = "inventory")
public void consumeInventory(InventoryUpdate update) {
// Has its own circuit breaker
inventoryCircuitBreaker.execute(() -> processInventory(update));
}
Topic-Specific Pause
Kafka allows pausing specific topic-partitions:
// Only pause orders topic
consumer.pause(
consumer.assignment().stream()
.filter(tp -> tp.topic().equals("orders"))
.collect(Collectors.toList())
);
// inventory and shipping continue
Edge Case 4: Exactly-Once Semantics Interaction
Using Kafka transactions with circuit breakers requires careful handling.
The Scenario
T0: Begin transaction
T1: Read message
T2: Process and produce to output topic
T3: About to commit
T4: Circuit opens mid-commit
T5: Transaction state unclear
Solution: Transaction-Aware Circuit Breaker
public void processWithTransaction(ConsumerRecord<K, V> record) {
// Check circuit BEFORE starting transaction
if (circuitBreaker.getState() == State.OPEN) {
throw new CircuitOpenException();
}
try {
producer.beginTransaction();
Result result = circuitBreaker.executeSupplier(
() -> callExternalApi(record.value()));
producer.send(new ProducerRecord<>("output", result));
// Send offsets within transaction
producer.sendOffsetsToTransaction(
Map.of(
new TopicPartition(record.topic(), record.partition()),
new OffsetAndMetadata(record.offset() + 1)
),
consumerGroupId
);
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
throw e;
}
}
Alternative Approaches
The circuit breaker + pause pattern isn't always the right choice. Here are alternatives.
Alternative 1: Dead Letter Queue (DLQ)
Main Topic βββΆ Consumer βββΆ Process βββΆ Success
β
βΌ Failure (after retries)
DLQ Topic
When to Use DLQ:
- Poison pill messages (malformed data)
- Non-retryable errors (validation failures)
- Need to continue processing other messages
- Message order is not critical
When NOT to Use DLQ:
- System-wide outages (DLQ fills up quickly)
- All messages would fail (need to replay entire DLQ)
Comparison:
| Aspect | Circuit Breaker | DLQ |
|------------------------|-----------------|------------------------|
| Message order | Preserved | Broken |
| System outage handling | Excellent | Poor |
| Bad data handling | Poor | Excellent |
| Operational complexity | Lower | Higher (replay needed) |
Alternative 2: Retry Topics (Non-Blocking)
main-topic
β
ββββΆ Success βββΆ Done
β
ββββΆ Failure βββΆ retry-topic-1 (1 min delay)
β
ββββΆ Success βββΆ Done
β
ββββΆ Failure βββΆ retry-topic-2 (10 min delay)
β
ββββΆ Failure βββΆ DLQ
When to Use:
- Mixed transient and permanent failures
- Can't pause all processing
- Acceptable to process messages out of order
Implementation:
@RetryableTopic(
attempts = 3,
backoff = @Backoff(delay = 60000, multiplier = 2),
dltTopicSuffix = ".DLQ"
)
@KafkaListener(topics = "orders")
public void consume(Order order) {
orderProcessor.process(order); // Throws on failure
}
Alternative 3: Bulkhead Pattern
Isolate failures to specific resource pools:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Service β
β β
β ββββββββββββββββββ ββββββββββββββββββ β
β β Thread Pool A β β Thread Pool B β β
β β β β β β
β β Payment API β β Shipping API β β
β β Threads: 10 β β Threads: 10 β β
β β β β β β
β β β Exhausted β β β Healthy β β
β ββββββββββββββββββ ββββββββββββββββββ β
β β
β Payment failures don't affect Shipping β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Best Used With: Circuit breaker (combine both patterns)
Alternative 4: Rate Limiting / Backpressure
Instead of stopping completely, slow down:
RateLimiter rateLimiter = RateLimiter.create(100); // 100/sec
public void consume(Order order) {
rateLimiter.acquire(); // Blocks if rate exceeded
orderProcessor.process(order);
}
When to Use:
- API rate limits (429 responses)
- Gradual degradation preferred
- Downstream can handle reduced rate
Decision Matrix: Which Pattern to Use
| Scenario | Circuit Breaker + Pause | DLQ | Retry Topics | Rate Limiter |
|-------------------------|-------------------------|------|--------------|--------------|
| System-wide outage | Best | Poor | Poor | Poor |
| Individual bad messages | Poor | Best | Good | Poor |
| Rate limiting (429s) | Good | Poor | Poor | Best |
| Mixed failures | Good | Good | Best | Poor |
| Order critical | Best | Poor | Poor | Good |
| Latency critical | Poor | Good | Good | Good |
The Honest Trade-offs
What You Gain
- Cascading failure prevention - Protect your system
- Message preservation - Nothing is lost
- Order preservation - Process in sequence
- Automatic recovery - No manual intervention
What You Pay
- Processing delays - Backlog builds during pause
- Complexity - More moving parts
- Potential rebalancing issues - Kafka consumer coordination
- All-or-nothing - Can't process some messages while paused
When This Pattern Shines
- Downstream service has infrequent but extended outages
- Message order matters
- You can accept processing delays
- You have few downstream dependencies
When to Choose Alternatives
- Frequent individual message failures β DLQ
- Rate limiting issues β Rate limiter
- Must continue processing other messages β Retry topics
- Mixed failure patterns β Combine approaches
What's Next
In Part 4, we'll cover:
- Configuration and tuning (with real numbers)
- Testing strategies and chaos engineering
- Best practices from production experience
- Anti-patterns to avoid
- Complete decision framework