Circuit Breaker Pattern with Kafka: A Complete Guide - Part 4
Circuit Breaker Pattern with Kafka: A Complete Guide
Part 4: Configuration, Testing, and Best Practices
Series Navigation:
- Part 1: Understanding the Fundamentals
- Part 2: Implementation and Real-World Scenarios
- Part 3: Challenges, Edge Cases, and Alternatives
- Part 4: Configuration, Testing, and Best Practices (You are here)
Configuration: Getting the Numbers Right
A lesson from Shopify Engineering: "If you've never thought about circuit breaker configuration and aren't heavily over-provisioned, I can almost guarantee you that your circuit breaker is configured wrong."
Configuration isn't guesswork. It requires understanding your system's behavior.
Essential Configuration Parameters
The Complete Parameter Set
| Parameter | What It Controls | Impact |
|-----------------------------------------|--------------------------------|-------------------------------------------------------|
| `failureRateThreshold` | % failures to trip circuit | Too low = false positives; Too high = slow protection |
| `slowCallRateThreshold` | % slow calls to trip | Catches degradation before failures |
| `slowCallDurationThreshold` | What counts as "slow" | Must be based on actual latency data |
| `slidingWindowType` | COUNT_BASED or TIME_BASED | COUNT for low volume; TIME for high volume |
| `slidingWindowSize` | Window for calculating rates | Larger = more stable; Smaller = more responsive |
| `minimumNumberOfCalls` | Calls before evaluation starts | Prevents tripping on first few failures |
| `waitDurationInOpenState` | Time before testing recovery | Too short = hammering; Too long = slow recovery |
| `permittedNumberOfCallsInHalfOpenState` | Test calls allowed | More = higher confidence; Fewer = faster decision |
Recommended Starting Configuration
resilience4j:
circuitbreaker:
instances:
externalApi:
# Failure detection
failure-rate-threshold: 50
slow-call-rate-threshold: 80
slow-call-duration-threshold: 2000ms
# Sliding window
sliding-window-type: COUNT_BASED
sliding-window-size: 20
minimum-number-of-calls: 10
# Recovery
wait-duration-in-open-state: 30s
permitted-number-of-calls-in-half-open-state: 5
automatic-transition-from-open-to-half-open-enabled: true
# What counts as failure
record-exceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.client.HttpServerErrorException
ignore-exceptions:
- com.example.ValidationException
The Tuning Process
Step 1: Measure Production Behavior
Before configuring, collect data:
Metrics to collect:
āāā Latency percentiles (p50, p90, p95, p99)
āāā Error rate under normal conditions
āāā Traffic patterns (requests/second over time)
āāā Downstream service SLAs
āāā Historical incident data
Example findings:
- p50 latency: 50ms
- p99 latency: 200ms
- Normal error rate: 0.5%
- Traffic: 100 requests/second average
Step 2: Calculate Thresholds
Based on the data:
Slow call threshold:
Normal p99 = 200ms
Add 25% buffer = 250ms
slowCallDurationThreshold = 250ms
Failure rate threshold:
Normal error rate = 0.5%
Significant degradation = 10x normal = 5%
But 5% is too sensitive for most systems.
Use 50% as a balanced starting point.
Adjust based on criticality.
Sliding window size:
At 100 req/sec, 20 calls = 200ms of data
This is responsive but may be noisy.
For more stability: 50-100 calls
For faster response: 10-20 calls
Step 3: Test in Staging
Create test scenarios:
Scenario 1: Gradual degradation
- Slowly increase downstream latency
- Verify circuit opens at correct threshold
Scenario 2: Sudden failure
- Kill downstream service
- Verify circuit opens quickly
Scenario 3: Recovery
- Bring downstream back
- Verify circuit closes correctly
Scenario 4: Flapping
- Service intermittently fails
- Verify no rapid open/close cycles
Step 4: Monitor Production
Deploy with comprehensive monitoring:
Alerts to configure:
āāā Circuit state changed to OPEN
āāā Circuit in OPEN state > 5 minutes
āāā High rejection rate (requests rejected by open circuit)
āāā Consumer lag increasing (sign of pause)
āāā Recovery failed (HALF-OPEN ā OPEN)
Step 5: Iterate
Review and adjust:
Weekly review:
āāā False positive rate (circuit opened unnecessarily)
āāā False negative rate (cascade happened despite circuit)
āāā Time to recovery
āāā Impact on end users
āāā Comparison with incident data
Kafka-Specific Configuration
Consumer Configuration for Circuit Breaker Compatibility
spring:
kafka:
consumer:
# Disable auto-commit for manual control
enable-auto-commit: false
# Increase session timeout to handle pause
session-timeout-ms: 60000
heartbeat-interval-ms: 20000
# Reduce batch size to minimize stranded messages
max-poll-records: 50
# Increase poll interval for long processing
max-poll-interval-ms: 300000
# Static membership to reduce rebalancing
group-instance-id: ${HOSTNAME}
# Cooperative rebalancing
partition-assignment-strategy:
org.apache.kafka.clients.consumer.CooperativeStickyAssignor
listener:
# Manual acknowledgment
ack-mode: MANUAL_IMMEDIATE
# Concurrent consumers
concurrency: 3
Why Each Setting Matters
| Setting | Purpose | Circuit Breaker Relevance |
|-------------------------------|-----------------------------|------------------------------|
| `enable-auto-commit: false` | Manual offset control | Don't commit failed messages |
| `session-timeout-ms: 60000` | Time before considered dead | Allow extended pause |
| `max-poll-records: 50` | Batch size | Fewer stranded messages |
| `max-poll-interval-ms: 300000`| Processing time limit | Handle slow recovery |
| `group-instance-id` | Static membership | No rebalance on pause |
| `ack-mode: MANUAL_IMMEDIATE` | When to commit | Commit only on success |
Testing Strategies
Unit Tests
Test circuit breaker state transitions:
@Test
void shouldOpenCircuitAfterThresholdFailures() {
CircuitBreaker breaker = createBreaker(failureThreshold: 50, windowSize: 10);
// 5 successes
for (int i = 0; i < 5; i++) {
breaker.executeSupplier(() -> "success");
}
assertThat(breaker.getState()).isEqualTo(CLOSED);
// 5 failures (now at 50% failure rate)
for (int i = 0; i < 5; i++) {
assertThrows(Exception.class,
() -> breaker.executeSupplier(() -> { throw new RuntimeException(); }));
}
assertThat(breaker.getState()).isEqualTo(OPEN);
}
@Test
void shouldPauseConsumerWhenCircuitOpens() {
// Given
CircuitBreaker breaker = createBreaker();
KafkaConsumer mockConsumer = mock(KafkaConsumer.class);
CircuitBreakerSupervisor supervisor = new CircuitBreakerSupervisor(breaker, mockConsumer);
// When
breaker.transitionToOpenState();
// Then
verify(mockConsumer).pause(any());
}
@Test
void shouldResumeConsumerWhenCircuitCloses() {
// Given
CircuitBreaker breaker = createBreaker();
KafkaConsumer mockConsumer = mock(KafkaConsumer.class);
CircuitBreakerSupervisor supervisor = new CircuitBreakerSupervisor(breaker, mockConsumer);
// When
breaker.transitionToOpenState();
breaker.transitionToHalfOpenState();
breaker.transitionToClosedState();
// Then
verify(mockConsumer).resume(any());
}
Integration Tests
Test end-to-end with embedded Kafka:
@EmbeddedKafka
@SpringBootTest
class CircuitBreakerKafkaIntegrationTest {
@Autowired
private KafkaTemplate<String, String> producer;
@Autowired
private CircuitBreaker circuitBreaker;
@MockBean
private ExternalApiClient apiClient;
@Test
void shouldPauseConsumptionDuringOutage() {
// Given: API starts failing
when(apiClient.call(any()))
.thenThrow(new RuntimeException("Service unavailable"));
// When: Send messages that will fail
for (int i = 0; i < 10; i++) {
producer.send("orders", "order-" + i);
}
// Then: Circuit should open and consumer should pause
await().atMost(Duration.ofSeconds(30))
.until(() -> circuitBreaker.getState() == OPEN);
// Verify consumer is paused (no new messages processed)
int processedBefore = getProcessedCount();
Thread.sleep(5000);
int processedAfter = getProcessedCount();
assertThat(processedAfter).isEqualTo(processedBefore);
}
@Test
void shouldResumeAndProcessBacklogAfterRecovery() {
// Given: API fails then recovers
AtomicInteger callCount = new AtomicInteger(0);
when(apiClient.call(any())).thenAnswer(inv -> {
if (callCount.incrementAndGet() < 10) {
throw new RuntimeException("Failing");
}
return "success"; // Recover after 10 calls
});
// When: Send messages
for (int i = 0; i < 20; i++) {
producer.send("orders", "order-" + i);
}
// Then: Eventually all messages should be processed
await().atMost(Duration.ofMinutes(2))
.until(() -> getProcessedCount() == 20);
}
}
Chaos Engineering Tests
Use tools to inject real failures:
With Toxiproxy:
@Test
void shouldHandleNetworkPartition() {
// Given: Toxiproxy controlling downstream connection
Proxy proxy = toxiproxy.createProxy("downstream", "0.0.0.0:8666", "downstream:8080");
// When: Create network partition
proxy.disable();
// Send messages
for (int i = 0; i < 10; i++) {
producer.send("orders", "order-" + i);
}
// Then: Circuit should open
await().until(() -> circuitBreaker.getState() == OPEN);
// When: Restore network
proxy.enable();
// Then: Circuit should close and process backlog
await().until(() -> circuitBreaker.getState() == CLOSED);
await().until(() -> getProcessedCount() == 10);
}
With Gremlin (in staging):
# Attack: Add 5 second latency to downstream API
gremlin attack network latency \
--target-container downstream-api \
--latency 5000 \
--length 60
# Expected: Circuit opens due to slow calls
# Kafka consumer pauses
# After attack ends, recovery within 1-2 minutes
Chaos Experiment Checklist:
| Experiment | Inject | Verify |
|-----------------|-----------------------|--------------------------------------|
| Latency | 5s delay | Circuit opens on slow call threshold |
| Errors | 500 responses | Circuit opens on failure threshold |
| Partition | Block traffic | Circuit opens, consumer pauses |
| Recovery | Restore service | Circuit closes, backlog processes |
| Thundering herd | Restore after backlog | No secondary failure |
Observability Requirements
Essential Metrics
Circuit Breaker Metrics:
āāā circuit_breaker_state{name="externalApi"} = 0|1|2
ā (0=closed, 1=open, 2=half-open)
āāā circuit_breaker_calls_total{name, outcome}
ā (outcome: success, failure, ignored, not_permitted)
āāā circuit_breaker_failure_rate{name}
āāā circuit_breaker_slow_call_rate{name}
āāā circuit_breaker_state_transitions_total{name, from, to}
Kafka Consumer Metrics:
āāā kafka_consumer_lag{topic, partition}
āāā kafka_consumer_paused{topic, partition}
āāā kafka_consumer_records_consumed_total
āāā kafka_consumer_fetch_latency_avg
Grafana Dashboard Panels
Row 1: Circuit Breaker State
āāā Current state (gauge)
āāā State transition timeline
āāā Time in each state (pie chart)
Row 2: Failure Rates
āāā Failure rate over time
āāā Slow call rate over time
āāā Rejection rate (calls blocked by open circuit)
Row 3: Kafka Consumer Health
āāā Consumer lag per partition
āāā Pause/resume events
āāā Messages processed rate
āāā Processing latency
Row 4: Downstream Service
āāā Response time percentiles
āāā Error rate
āāā Request volume
āāā Availability
Alerting Rules
groups:
- name: circuit-breaker-alerts
rules:
- alert: CircuitBreakerOpen
expr: circuit_breaker_state == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is OPEN"
- alert: CircuitBreakerOpenExtended
expr: circuit_breaker_state == 1
for: 10m
labels:
severity: critical
annotations:
summary: "Circuit breaker {{ $labels.name }} OPEN for >10 minutes"
- alert: KafkaConsumerLagHigh
expr: kafka_consumer_lag > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka consumer lag is {{ $value }} messages"
- alert: CircuitBreakerFlapping
expr: increase(circuit_breaker_state_transitions_total[10m]) > 4
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is flapping"
Best Practices Summary
Architecture
- Prefer asynchronous patterns - Reduce need for circuit breakers
- Isolate downstream calls - Separate circuits per service
- Combine patterns - Circuit breaker + bulkhead + retry
- Design for failure - Assume downstream will fail
Implementation
- Use established libraries - Resilience4j, Polly, opossum
- Handle polled messages - Check circuit state before processing
- Implement idempotency - Safe reprocessing
- Log state transitions - Visibility into behavior
Configuration
- Base on production data - Not guesswork
- Use percentage thresholds - More stable than counts
- Set appropriate windows - Balance responsiveness and stability
- Enable automatic HALF-OPEN - Faster recovery
Kafka-Specific
- Disable auto-commit - Manual offset control
- Use static membership - Reduce rebalancing
- Cooperative assignor - Minimal disruption
- Increase session timeout - Handle pause duration
Operations
- Alert on state changes - Know when circuits trip
- Monitor consumer lag - Detect paused consumers
- Practice chaos engineering - Test failure scenarios
- Review and iterate - Tune based on experience
Anti-Patterns to Avoid
1. The "Copy-Paste Configuration" Anti-Pattern
Problem: Using the same configuration everywhere without understanding the implications.
Example:
# DON'T: Same config for all services
all-services:
failure-rate-threshold: 50
wait-duration-in-open-state: 30s
Solution: Tune each circuit breaker based on that specific service's behavior.
2. The "Retry Forever" Anti-Pattern
Problem: Infinite retries without circuit breaker protection.
// DON'T
while (true) {
try {
return apiClient.call();
} catch (Exception e) {
Thread.sleep(1000); // Retry forever
}
}
Solution: Limit retries, then let circuit breaker protect.
3. The "Commit Anyway" Anti-Pattern
Problem: Committing Kafka offsets even when processing fails.
// DON'T
public void consume(Message msg, Acknowledgment ack) {
try {
process(msg);
} catch (Exception e) {
log.error("Failed", e);
}
ack.acknowledge(); // Commits even on failure!
}
Solution: Only acknowledge on success.
4. The "No Fallback" Anti-Pattern
Problem: Just failing without any graceful degradation.
Solution: Implement meaningful fallbacks:
circuitBreaker.executeSupplier(
() -> primaryService.getPrice(),
throwable -> cachedPriceService.getLastKnownPrice() // Fallback
);
5. The "Ignore Rebalancing" Anti-Pattern
Problem: Not accounting for Kafka consumer rebalancing during pause.
Solution: Use static membership and cooperative assignor.
6. The "Circuit Breaker for Everything" Anti-Pattern
Problem: Using circuit breakers where they don't make sense.
When circuit breaker is wrong:
- Bad data (use DLQ)
- Rate limiting (use rate limiter)
- Validation errors (handle explicitly)
7. The "No Monitoring" Anti-Pattern
Problem: Circuit breakers without visibility.
Solution: Metrics, dashboards, alerts for every circuit.
Complete Decision Framework
Should You Use This Pattern?
START
ā
ā¼
Does your Kafka consumer call external services?
ā
āāā NO āāā¶ Circuit breaker not needed
ā
YES
ā
ā¼
Can those services have extended outages (>30 seconds)?
ā
āāā NO āāā¶ Retry with exponential backoff may suffice
ā
YES
ā
ā¼
Is message processing order important?
ā
āāā NO āāā¶ Consider DLQ + Retry Topics instead
ā
YES
ā
ā¼
Can you tolerate processing delays during outages?
ā
āāā NO āāā¶ Need fallback mechanism + circuit breaker
ā
YES
ā
ā¼
Do you have few downstream dependencies (1-3)?
ā
āāā NO āāā¶ Consider separating into multiple consumers
ā
YES
ā
ā¼
USE CIRCUIT BREAKER + KAFKA PAUSE PATTERN
ā
āāā Implement with Resilience4j (Java) or equivalent
āāā Configure based on production metrics
āāā Add comprehensive monitoring
āāā Test with chaos engineering
āāā Review and iterate
Final Thoughts
The circuit breaker pattern with Kafka consumer pause is a powerful tool for building resilient event-driven systems. It prevents cascading failures, preserves messages, and enables automatic recovery.
But it's not a silver bullet. Success requires:
- Understanding your system - Know your latencies, error rates, traffic patterns
- Proper configuration - Based on data, not guesses
- Comprehensive testing - Including chaos engineering
- Continuous monitoring - Visibility into circuit state
- Willingness to iterate - Tune based on production experience
When implemented correctly, this pattern can mean the difference between a minor downstream hiccup and a full system outage.
Quick Reference Card
Circuit Breaker States
| State | Kafka Consumer. | Requests |
|-----------|------------------|----------|
| CLOSED | Running | Allowed |
| OPEN | Paused | Rejected |
| HALF-OPEN | Paused (testing) | Limited |
Key Configuration
| Parameter | Starting Value |
|------------------------------|----------------|
| failure-rate-threshold | 50% |
| slow-call-duration-threshold | p99 + 25% |
| sliding-window-size | 10-20 |
| wait-duration-in-open-state | 30s |
Kafka Settings
| Setting | Value |
|---------------------|---------------------|
| enable-auto-commit | false |
| ack-mode | MANUAL_IMMEDIATE |
| session-timeout-ms | 60000 |
| group-instance-id | unique-per-instance |
Further Reading
- Release It! by Michael Nygard - The original circuit breaker description
- Resilience4j Documentation - Comprehensive library docs
- Shopify Engineering: Your Circuit Breaker is Misconfigured - Real-world tuning lessons
- Martin Fowler: Circuit Breaker - Pattern description
- Confluent: Kafka Consumer Configuration - Official Kafka docs
This concludes the Circuit Breaker Pattern with Kafka series.
Series Index: