Circuit Breaker Pattern with Kafka: A Complete Guide

Part 4: Configuration, Testing, and Best Practices

Series Navigation:

Part 1: Understanding the Fundamentals

Part 2: Implementation and Real-World Scenarios

Part 3: Challenges, Edge Cases, and Alternatives

Part 4: Configuration, Testing, and Best Practices (You are here)

Configuration: Getting the Numbers Right

A lesson from Shopify Engineering: "If you've never thought about circuit breaker configuration and aren't heavily over-provisioned, I can almost guarantee you that your circuit breaker is configured wrong."

Configuration isn't guesswork. It requires understanding your system's behavior.

Essential Configuration Parameters

The Complete Parameter Set

| Parameter                               | What It Controls               | Impact                                                |
|-----------------------------------------|--------------------------------|-------------------------------------------------------|
| `failureRateThreshold`                  | % failures to trip circuit     | Too low = false positives; Too high = slow protection |
| `slowCallRateThreshold`                 | % slow calls to trip           | Catches degradation before failures                   |
| `slowCallDurationThreshold`             | What counts as "slow"          | Must be based on actual latency data                  |
| `slidingWindowType`                     | COUNT_BASED or TIME_BASED      | COUNT for low volume; TIME for high volume            |
| `slidingWindowSize`                     | Window for calculating rates   | Larger = more stable; Smaller = more responsive       |
| `minimumNumberOfCalls`                  | Calls before evaluation starts | Prevents tripping on first few failures               |
| `waitDurationInOpenState`               | Time before testing recovery   | Too short = hammering; Too long = slow recovery       |
| `permittedNumberOfCallsInHalfOpenState` | Test calls allowed             | More = higher confidence; Fewer = faster decision     |

Recommended Starting Configuration

resilience4j:
  circuitbreaker:
    instances:
      externalApi:
        # Failure detection
        failure-rate-threshold: 50
        slow-call-rate-threshold: 80
        slow-call-duration-threshold: 2000ms

        # Sliding window
        sliding-window-type: COUNT_BASED
        sliding-window-size: 20
        minimum-number-of-calls: 10

        # Recovery
        wait-duration-in-open-state: 30s
        permitted-number-of-calls-in-half-open-state: 5
        automatic-transition-from-open-to-half-open-enabled: true

        # What counts as failure
        record-exceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.client.HttpServerErrorException
        ignore-exceptions:
          - com.example.ValidationException

The Tuning Process

Step 1: Measure Production Behavior

Before configuring, collect data:

Metrics to collect:
├── Latency percentiles (p50, p90, p95, p99)
├── Error rate under normal conditions
├── Traffic patterns (requests/second over time)
├── Downstream service SLAs
└── Historical incident data

Example findings:

p50 latency: 50ms
p99 latency: 200ms
Normal error rate: 0.5%
Traffic: 100 requests/second average

Step 2: Calculate Thresholds

Based on the data:

Slow call threshold:

Normal p99 = 200ms
Add 25% buffer = 250ms

slowCallDurationThreshold = 250ms

Failure rate threshold:

Normal error rate = 0.5%
Significant degradation = 10x normal = 5%

But 5% is too sensitive for most systems.
Use 50% as a balanced starting point.
Adjust based on criticality.

Sliding window size:

At 100 req/sec, 20 calls = 200ms of data
This is responsive but may be noisy.

For more stability: 50-100 calls
For faster response: 10-20 calls

Step 3: Test in Staging

Create test scenarios:

Scenario 1: Gradual degradation
- Slowly increase downstream latency
- Verify circuit opens at correct threshold

Scenario 2: Sudden failure
- Kill downstream service
- Verify circuit opens quickly

Scenario 3: Recovery
- Bring downstream back
- Verify circuit closes correctly

Scenario 4: Flapping
- Service intermittently fails
- Verify no rapid open/close cycles

Step 4: Monitor Production

Deploy with comprehensive monitoring:

Alerts to configure:
├── Circuit state changed to OPEN
├── Circuit in OPEN state > 5 minutes
├── High rejection rate (requests rejected by open circuit)
├── Consumer lag increasing (sign of pause)
└── Recovery failed (HALF-OPEN → OPEN)

Step 5: Iterate

Review and adjust:

Weekly review:
├── False positive rate (circuit opened unnecessarily)
├── False negative rate (cascade happened despite circuit)
├── Time to recovery
├── Impact on end users
└── Comparison with incident data

Kafka-Specific Configuration

Consumer Configuration for Circuit Breaker Compatibility

spring:
  kafka:
    consumer:
      # Disable auto-commit for manual control
      enable-auto-commit: false

      # Increase session timeout to handle pause
      session-timeout-ms: 60000
      heartbeat-interval-ms: 20000

      # Reduce batch size to minimize stranded messages
      max-poll-records: 50

      # Increase poll interval for long processing
      max-poll-interval-ms: 300000

      # Static membership to reduce rebalancing
      group-instance-id: ${HOSTNAME}

      # Cooperative rebalancing
      partition-assignment-strategy:
        org.apache.kafka.clients.consumer.CooperativeStickyAssignor

    listener:
      # Manual acknowledgment
      ack-mode: MANUAL_IMMEDIATE

      # Concurrent consumers
      concurrency: 3

Why Each Setting Matters

| Setting                       | Purpose                     | Circuit Breaker Relevance    |
|-------------------------------|-----------------------------|------------------------------|
| `enable-auto-commit: false`   | Manual offset control       | Don't commit failed messages |
| `session-timeout-ms: 60000`   | Time before considered dead | Allow extended pause         |
| `max-poll-records: 50`        | Batch size                  | Fewer stranded messages      |
| `max-poll-interval-ms: 300000`| Processing time limit       | Handle slow recovery         |
| `group-instance-id`           | Static membership           | No rebalance on pause        |
| `ack-mode: MANUAL_IMMEDIATE`  | When to commit              | Commit only on success       |

Testing Strategies

Unit Tests

Test circuit breaker state transitions:

@Test
void shouldOpenCircuitAfterThresholdFailures() {
    CircuitBreaker breaker = createBreaker(failureThreshold: 50, windowSize: 10);

    // 5 successes
    for (int i = 0; i < 5; i++) {
        breaker.executeSupplier(() -> "success");
    }
    assertThat(breaker.getState()).isEqualTo(CLOSED);

    // 5 failures (now at 50% failure rate)
    for (int i = 0; i < 5; i++) {
        assertThrows(Exception.class,
            () -> breaker.executeSupplier(() -> { throw new RuntimeException(); }));
    }
    assertThat(breaker.getState()).isEqualTo(OPEN);
}

@Test
void shouldPauseConsumerWhenCircuitOpens() {
    // Given
    CircuitBreaker breaker = createBreaker();
    KafkaConsumer mockConsumer = mock(KafkaConsumer.class);
    CircuitBreakerSupervisor supervisor = new CircuitBreakerSupervisor(breaker, mockConsumer);

    // When
    breaker.transitionToOpenState();

    // Then
    verify(mockConsumer).pause(any());
}

@Test
void shouldResumeConsumerWhenCircuitCloses() {
    // Given
    CircuitBreaker breaker = createBreaker();
    KafkaConsumer mockConsumer = mock(KafkaConsumer.class);
    CircuitBreakerSupervisor supervisor = new CircuitBreakerSupervisor(breaker, mockConsumer);

    // When
    breaker.transitionToOpenState();
    breaker.transitionToHalfOpenState();
    breaker.transitionToClosedState();

    // Then
    verify(mockConsumer).resume(any());
}

Integration Tests

Test end-to-end with embedded Kafka:

@EmbeddedKafka
@SpringBootTest
class CircuitBreakerKafkaIntegrationTest {

    @Autowired
    private KafkaTemplate<String, String> producer;

    @Autowired
    private CircuitBreaker circuitBreaker;

    @MockBean
    private ExternalApiClient apiClient;

    @Test
    void shouldPauseConsumptionDuringOutage() {
        // Given: API starts failing
        when(apiClient.call(any()))
            .thenThrow(new RuntimeException("Service unavailable"));

        // When: Send messages that will fail
        for (int i = 0; i < 10; i++) {
            producer.send("orders", "order-" + i);
        }

        // Then: Circuit should open and consumer should pause
        await().atMost(Duration.ofSeconds(30))
            .until(() -> circuitBreaker.getState() == OPEN);

        // Verify consumer is paused (no new messages processed)
        int processedBefore = getProcessedCount();
        Thread.sleep(5000);
        int processedAfter = getProcessedCount();
        assertThat(processedAfter).isEqualTo(processedBefore);
    }

    @Test
    void shouldResumeAndProcessBacklogAfterRecovery() {
        // Given: API fails then recovers
        AtomicInteger callCount = new AtomicInteger(0);
        when(apiClient.call(any())).thenAnswer(inv -> {
            if (callCount.incrementAndGet() < 10) {
                throw new RuntimeException("Failing");
            }
            return "success";  // Recover after 10 calls
        });

        // When: Send messages
        for (int i = 0; i < 20; i++) {
            producer.send("orders", "order-" + i);
        }

        // Then: Eventually all messages should be processed
        await().atMost(Duration.ofMinutes(2))
            .until(() -> getProcessedCount() == 20);
    }
}

Chaos Engineering Tests

Use tools to inject real failures:

With Toxiproxy:

@Test
void shouldHandleNetworkPartition() {
    // Given: Toxiproxy controlling downstream connection
    Proxy proxy = toxiproxy.createProxy("downstream", "0.0.0.0:8666", "downstream:8080");

    // When: Create network partition
    proxy.disable();

    // Send messages
    for (int i = 0; i < 10; i++) {
        producer.send("orders", "order-" + i);
    }

    // Then: Circuit should open
    await().until(() -> circuitBreaker.getState() == OPEN);

    // When: Restore network
    proxy.enable();

    // Then: Circuit should close and process backlog
    await().until(() -> circuitBreaker.getState() == CLOSED);
    await().until(() -> getProcessedCount() == 10);
}

With Gremlin (in staging):

# Attack: Add 5 second latency to downstream API
gremlin attack network latency \
  --target-container downstream-api \
  --latency 5000 \
  --length 60

# Expected: Circuit opens due to slow calls
# Kafka consumer pauses
# After attack ends, recovery within 1-2 minutes

Chaos Experiment Checklist:

| Experiment      | Inject                | Verify                               |
|-----------------|-----------------------|--------------------------------------|
| Latency         | 5s delay              | Circuit opens on slow call threshold |
| Errors          | 500 responses         | Circuit opens on failure threshold   |
| Partition       | Block traffic         | Circuit opens, consumer pauses       |
| Recovery        | Restore service       | Circuit closes, backlog processes    |
| Thundering herd | Restore after backlog | No secondary failure                 |

Observability Requirements

Essential Metrics

Circuit Breaker Metrics:
├── circuit_breaker_state{name="externalApi"} = 0|1|2
│   (0=closed, 1=open, 2=half-open)
├── circuit_breaker_calls_total{name, outcome}
│   (outcome: success, failure, ignored, not_permitted)
├── circuit_breaker_failure_rate{name}
├── circuit_breaker_slow_call_rate{name}
└── circuit_breaker_state_transitions_total{name, from, to}

Kafka Consumer Metrics:
├── kafka_consumer_lag{topic, partition}
├── kafka_consumer_paused{topic, partition}
├── kafka_consumer_records_consumed_total
└── kafka_consumer_fetch_latency_avg

Grafana Dashboard Panels

Row 1: Circuit Breaker State
├── Current state (gauge)
├── State transition timeline
└── Time in each state (pie chart)

Row 2: Failure Rates
├── Failure rate over time
├── Slow call rate over time
└── Rejection rate (calls blocked by open circuit)

Row 3: Kafka Consumer Health
├── Consumer lag per partition
├── Pause/resume events
├── Messages processed rate
└── Processing latency

Row 4: Downstream Service
├── Response time percentiles
├── Error rate
├── Request volume
└── Availability

Alerting Rules

groups:
  - name: circuit-breaker-alerts
    rules:
      - alert: CircuitBreakerOpen
        expr: circuit_breaker_state == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker {{ $labels.name }} is OPEN"

      - alert: CircuitBreakerOpenExtended
        expr: circuit_breaker_state == 1
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker {{ $labels.name }} OPEN for >10 minutes"

      - alert: KafkaConsumerLagHigh
        expr: kafka_consumer_lag > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Kafka consumer lag is {{ $value }} messages"

      - alert: CircuitBreakerFlapping
        expr: increase(circuit_breaker_state_transitions_total[10m]) > 4
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker {{ $labels.name }} is flapping"

Best Practices Summary

Architecture

Prefer asynchronous patterns - Reduce need for circuit breakers
Isolate downstream calls - Separate circuits per service
Combine patterns - Circuit breaker + bulkhead + retry
Design for failure - Assume downstream will fail

Implementation

Use established libraries - Resilience4j, Polly, opossum
Handle polled messages - Check circuit state before processing
Implement idempotency - Safe reprocessing
Log state transitions - Visibility into behavior

Configuration

Base on production data - Not guesswork
Use percentage thresholds - More stable than counts
Set appropriate windows - Balance responsiveness and stability
Enable automatic HALF-OPEN - Faster recovery

Kafka-Specific

Disable auto-commit - Manual offset control
Use static membership - Reduce rebalancing
Cooperative assignor - Minimal disruption
Increase session timeout - Handle pause duration

Operations

Alert on state changes - Know when circuits trip
Monitor consumer lag - Detect paused consumers
Practice chaos engineering - Test failure scenarios
Review and iterate - Tune based on experience

Anti-Patterns to Avoid

1. The "Copy-Paste Configuration" Anti-Pattern

Problem: Using the same configuration everywhere without understanding the implications.

Example:

# DON'T: Same config for all services
all-services:
  failure-rate-threshold: 50
  wait-duration-in-open-state: 30s

Solution: Tune each circuit breaker based on that specific service's behavior.

2. The "Retry Forever" Anti-Pattern

Problem: Infinite retries without circuit breaker protection.

// DON'T
while (true) {
    try {
        return apiClient.call();
    } catch (Exception e) {
        Thread.sleep(1000);  // Retry forever
    }
}

Solution: Limit retries, then let circuit breaker protect.

3. The "Commit Anyway" Anti-Pattern

Problem: Committing Kafka offsets even when processing fails.

// DON'T
public void consume(Message msg, Acknowledgment ack) {
    try {
        process(msg);
    } catch (Exception e) {
        log.error("Failed", e);
    }
    ack.acknowledge();  // Commits even on failure!
}

Solution: Only acknowledge on success.

4. The "No Fallback" Anti-Pattern

Problem: Just failing without any graceful degradation.

Solution: Implement meaningful fallbacks:

circuitBreaker.executeSupplier(
    () -> primaryService.getPrice(),
    throwable -> cachedPriceService.getLastKnownPrice()  // Fallback
);

5. The "Ignore Rebalancing" Anti-Pattern

Problem: Not accounting for Kafka consumer rebalancing during pause.

Solution: Use static membership and cooperative assignor.

6. The "Circuit Breaker for Everything" Anti-Pattern

Problem: Using circuit breakers where they don't make sense.

When circuit breaker is wrong:

Bad data (use DLQ)
Rate limiting (use rate limiter)
Validation errors (handle explicitly)

7. The "No Monitoring" Anti-Pattern

Problem: Circuit breakers without visibility.

Solution: Metrics, dashboards, alerts for every circuit.

Complete Decision Framework

Should You Use This Pattern?

START
  │
  ▼
Does your Kafka consumer call external services?
  │
  ├── NO ──▶ Circuit breaker not needed
  │
  YES
  │
  ▼
Can those services have extended outages (>30 seconds)?
  │
  ├── NO ──▶ Retry with exponential backoff may suffice
  │
  YES
  │
  ▼
Is message processing order important?
  │
  ├── NO ──▶ Consider DLQ + Retry Topics instead
  │
  YES
  │
  ▼
Can you tolerate processing delays during outages?
  │
  ├── NO ──▶ Need fallback mechanism + circuit breaker
  │
  YES
  │
  ▼
Do you have few downstream dependencies (1-3)?
  │
  ├── NO ──▶ Consider separating into multiple consumers
  │
  YES
  │
  ▼
USE CIRCUIT BREAKER + KAFKA PAUSE PATTERN
  │
  ├── Implement with Resilience4j (Java) or equivalent
  ├── Configure based on production metrics
  ├── Add comprehensive monitoring
  ├── Test with chaos engineering
  └── Review and iterate

Final Thoughts

The circuit breaker pattern with Kafka consumer pause is a powerful tool for building resilient event-driven systems. It prevents cascading failures, preserves messages, and enables automatic recovery.

But it's not a silver bullet. Success requires:

Understanding your system - Know your latencies, error rates, traffic patterns
Proper configuration - Based on data, not guesses
Comprehensive testing - Including chaos engineering
Continuous monitoring - Visibility into circuit state
Willingness to iterate - Tune based on production experience

When implemented correctly, this pattern can mean the difference between a minor downstream hiccup and a full system outage.

Quick Reference Card

Circuit Breaker States

| State     | Kafka Consumer.  | Requests |
|-----------|------------------|----------|
| CLOSED    | Running          | Allowed  |
| OPEN      | Paused           | Rejected |
| HALF-OPEN | Paused (testing) | Limited  |

Key Configuration

| Parameter                    | Starting Value |
|------------------------------|----------------|
| failure-rate-threshold       | 50%            |
| slow-call-duration-threshold | p99 + 25%      |
| sliding-window-size          | 10-20          |
| wait-duration-in-open-state  | 30s            |

Kafka Settings

| Setting             | Value               |
|---------------------|---------------------|
| enable-auto-commit  | false               |
| ack-mode            | MANUAL_IMMEDIATE    |
| session-timeout-ms  | 60000               |
| group-instance-id   | unique-per-instance |