-
Notifications
You must be signed in to change notification settings - Fork 71
Description
Summary
SSE and WebSocket client connections backed by Kafka experience delays or apparent "hangs" when Kafka topic configurations are aggressive (e.g., retention.ms=1h, segment.ms=10m, min.cleanable.dirty.ratio=0.1). This is because Zilla gates the client-facing reply stream (BEGIN) on all Kafka partitions becoming ready, which can be slow or unstable during frequent segment rolling, retention deletes, and compaction.
Problem Description
When a client connects to an SSE or WebSocket endpoint backed by Kafka:
- Zilla creates a merged Kafka stream
- Discovers topic metadata and partition leaders
- Creates a fetch stream for each partition
- Waits for ALL partitions to report ready before sending BEGIN to the client
Key Gating Code
In KafkaMergedFactory.java (line ~2154-2172):
private void onFetchPartitionLeaderReady(
long traceId,
long partitionId,
long stableOffset,
long latestOffset)
{
nextOffsetsById.putIfAbsent(partitionId, defaultOffset);
latestOffsetByPartitionId.put(partitionId, latestOffset);
stableOffsetByPartitionId.put(partitionId, stableOffset);
// ALL partitions must be ready before client gets BEGIN
if (nextOffsetsById.size() == fetchStreams.size() &&
latestOffsetByPartitionId.size() == fetchStreams.size())
{
doMergedReplyBeginIfNecessary(traceId);
}
}The same pattern exists in KafkaCacheBootstrapFactory.java (line ~780-793).
Impact
- If any single partition is slow (leader election, offset resolution), the entire client connection stalls
OFFSET_OUT_OF_RANGEerrors (common with short retention) trigger re-seek cyclesNOT_LEADER_FOR_PARTITIONerrors trigger metadata refresh loops- Reconnect delays use exponential backoff (50ms → 5s default)
Scenarios That Trigger Delays
| Kafka Config | Why It Causes Issues |
|---|---|
retention.ms=3600000 (1h) |
Frequent log deletions cause OFFSET_OUT_OF_RANGE |
segment.ms=600000 (10m) |
Frequent segment rolls trigger leader changes |
min.cleanable.dirty.ratio=0.1 |
Aggressive compaction causes metadata instability |
| High partition count | More partitions = more chances for one to be slow |
Suggested Optimizations
1. Progressive Reply (High Priority)
Instead of waiting for ALL partitions to be ready, send BEGIN to the client after the first partition is ready. Stream data as partitions become available.
Current behavior: Client waits for slowest partition
Proposed behavior: Client gets data from ready partitions while others catch up
2. Configurable Partition Readiness Timeout
Add a configuration option to proceed after a timeout if some partitions are not ready:
options:
topics:
- name: my-topic
partitionReadyTimeout: 5s # Proceed after 5s even if some partitions not ready3. Smarter OFFSET_OUT_OF_RANGE Handling
Current code (line ~2975 in KafkaClientFetchFactory.java):
case ERROR_OFFSET_OUT_OF_RANGE:
// TODO: recover at EARLIEST or LATEST ?
nextOffset = OFFSET_HISTORICAL;
client.encoder = client.encodeOffsetsRequest;The // TODO comment indicates this is a known issue. Consider:
- Making the recovery behavior configurable (EARLIEST vs LATEST)
- Avoiding repeated ListOffsets cycles
- Faster fallback to LATEST for live consumers
4. Reduce Default Reconnect Delay
Current default cache.server.reconnect=5 seconds is too aggressive for real-time streaming.
Suggestion: Reduce to 1 second or make it adaptive based on error type.
5. Document Bootstrap Behavior
The cache.server.bootstrap=true default means cache must warm up before serving. This should be clearly documented, and users should be advised to set defaultOffset: live for real-time use cases.
Workarounds (Current)
Users can mitigate with:
# Reduce reconnect delay
zilla.binding.kafka.cache.server.reconnect=1
# Use live offset to skip historical replay
# (in zilla.yaml topic config)
defaultOffset: live
# Optionally disable bootstrap
zilla.binding.kafka.cache.server.bootstrap=falseDebugging
Enable debug logging with:
-Dzilla.binding.kafka.debug=trueLook for:
FETCH reconnect in Xs- indicates reconnect delaysFETCH disconnect- partition stream failures- Gaps in partition ready messages
Related Files
runtime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaMergedFactory.javaruntime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaCacheBootstrapFactory.javaruntime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaClientFetchFactory.javaruntime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaCacheServerFetchFactory.java
Environment
- Zilla version: Latest (develop branch)
- Kafka: Any version with aggressive topic configs
- Bindings: SSE-Kafka, WebSocket (via HTTP-Kafka)
Labels
enhancement kafka sse performance