Skip to content

SSE/WebSocket connections stall waiting for all Kafka partitions to become ready #1628

@nikhil-cli

Description

@nikhil-cli

Summary

SSE and WebSocket client connections backed by Kafka experience delays or apparent "hangs" when Kafka topic configurations are aggressive (e.g., retention.ms=1h, segment.ms=10m, min.cleanable.dirty.ratio=0.1). This is because Zilla gates the client-facing reply stream (BEGIN) on all Kafka partitions becoming ready, which can be slow or unstable during frequent segment rolling, retention deletes, and compaction.

Problem Description

When a client connects to an SSE or WebSocket endpoint backed by Kafka:

  1. Zilla creates a merged Kafka stream
  2. Discovers topic metadata and partition leaders
  3. Creates a fetch stream for each partition
  4. Waits for ALL partitions to report ready before sending BEGIN to the client

Key Gating Code

In KafkaMergedFactory.java (line ~2154-2172):

private void onFetchPartitionLeaderReady(
    long traceId,
    long partitionId,
    long stableOffset,
    long latestOffset)
{
    nextOffsetsById.putIfAbsent(partitionId, defaultOffset);
    latestOffsetByPartitionId.put(partitionId, latestOffset);
    stableOffsetByPartitionId.put(partitionId, stableOffset);

    // ALL partitions must be ready before client gets BEGIN
    if (nextOffsetsById.size() == fetchStreams.size() &&
        latestOffsetByPartitionId.size() == fetchStreams.size())
    {
        doMergedReplyBeginIfNecessary(traceId);
    }
}

The same pattern exists in KafkaCacheBootstrapFactory.java (line ~780-793).

Impact

  • If any single partition is slow (leader election, offset resolution), the entire client connection stalls
  • OFFSET_OUT_OF_RANGE errors (common with short retention) trigger re-seek cycles
  • NOT_LEADER_FOR_PARTITION errors trigger metadata refresh loops
  • Reconnect delays use exponential backoff (50ms → 5s default)

Scenarios That Trigger Delays

Kafka Config Why It Causes Issues
retention.ms=3600000 (1h) Frequent log deletions cause OFFSET_OUT_OF_RANGE
segment.ms=600000 (10m) Frequent segment rolls trigger leader changes
min.cleanable.dirty.ratio=0.1 Aggressive compaction causes metadata instability
High partition count More partitions = more chances for one to be slow

Suggested Optimizations

1. Progressive Reply (High Priority)

Instead of waiting for ALL partitions to be ready, send BEGIN to the client after the first partition is ready. Stream data as partitions become available.

Current behavior: Client waits for slowest partition
Proposed behavior: Client gets data from ready partitions while others catch up

2. Configurable Partition Readiness Timeout

Add a configuration option to proceed after a timeout if some partitions are not ready:

options:
  topics:
    - name: my-topic
      partitionReadyTimeout: 5s  # Proceed after 5s even if some partitions not ready

3. Smarter OFFSET_OUT_OF_RANGE Handling

Current code (line ~2975 in KafkaClientFetchFactory.java):

case ERROR_OFFSET_OUT_OF_RANGE:
    // TODO: recover at EARLIEST or LATEST ?
    nextOffset = OFFSET_HISTORICAL;
    client.encoder = client.encodeOffsetsRequest;

The // TODO comment indicates this is a known issue. Consider:

  • Making the recovery behavior configurable (EARLIEST vs LATEST)
  • Avoiding repeated ListOffsets cycles
  • Faster fallback to LATEST for live consumers

4. Reduce Default Reconnect Delay

Current default cache.server.reconnect=5 seconds is too aggressive for real-time streaming.

Suggestion: Reduce to 1 second or make it adaptive based on error type.

5. Document Bootstrap Behavior

The cache.server.bootstrap=true default means cache must warm up before serving. This should be clearly documented, and users should be advised to set defaultOffset: live for real-time use cases.

Workarounds (Current)

Users can mitigate with:

# Reduce reconnect delay
zilla.binding.kafka.cache.server.reconnect=1

# Use live offset to skip historical replay
# (in zilla.yaml topic config)
defaultOffset: live

# Optionally disable bootstrap
zilla.binding.kafka.cache.server.bootstrap=false

Debugging

Enable debug logging with:

-Dzilla.binding.kafka.debug=true

Look for:

  • FETCH reconnect in Xs - indicates reconnect delays
  • FETCH disconnect - partition stream failures
  • Gaps in partition ready messages

Related Files

  • runtime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaMergedFactory.java
  • runtime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaCacheBootstrapFactory.java
  • runtime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaClientFetchFactory.java
  • runtime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaCacheServerFetchFactory.java

Environment

  • Zilla version: Latest (develop branch)
  • Kafka: Any version with aggressive topic configs
  • Bindings: SSE-Kafka, WebSocket (via HTTP-Kafka)

Labels

enhancement kafka sse performance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions