SSE/WebSocket connections stall waiting for all Kafka partitions to become ready

## Summary

SSE and WebSocket client connections backed by Kafka experience delays or apparent "hangs" when Kafka topic configurations are aggressive (e.g., `retention.ms=1h`, `segment.ms=10m`, `min.cleanable.dirty.ratio=0.1`). This is because Zilla gates the client-facing reply stream (BEGIN) on **all Kafka partitions becoming ready**, which can be slow or unstable during frequent segment rolling, retention deletes, and compaction.

## Problem Description

When a client connects to an SSE or WebSocket endpoint backed by Kafka:

1. Zilla creates a merged Kafka stream
2. Discovers topic metadata and partition leaders
3. Creates a fetch stream for **each partition**
4. **Waits for ALL partitions to report ready** before sending BEGIN to the client

### Key Gating Code

In `KafkaMergedFactory.java` (line ~2154-2172):

```java
private void onFetchPartitionLeaderReady(
    long traceId,
    long partitionId,
    long stableOffset,
    long latestOffset)
{
    nextOffsetsById.putIfAbsent(partitionId, defaultOffset);
    latestOffsetByPartitionId.put(partitionId, latestOffset);
    stableOffsetByPartitionId.put(partitionId, stableOffset);

    // ALL partitions must be ready before client gets BEGIN
    if (nextOffsetsById.size() == fetchStreams.size() &&
        latestOffsetByPartitionId.size() == fetchStreams.size())
    {
        doMergedReplyBeginIfNecessary(traceId);
    }
}
```

The same pattern exists in `KafkaCacheBootstrapFactory.java` (line ~780-793).

### Impact

- If any single partition is slow (leader election, offset resolution), the **entire client connection stalls**
- `OFFSET_OUT_OF_RANGE` errors (common with short retention) trigger re-seek cycles
- `NOT_LEADER_FOR_PARTITION` errors trigger metadata refresh loops
- Reconnect delays use exponential backoff (50ms → 5s default)

## Scenarios That Trigger Delays

| Kafka Config | Why It Causes Issues |
|--------------|---------------------|
| `retention.ms=3600000` (1h) | Frequent log deletions cause `OFFSET_OUT_OF_RANGE` |
| `segment.ms=600000` (10m) | Frequent segment rolls trigger leader changes |
| `min.cleanable.dirty.ratio=0.1` | Aggressive compaction causes metadata instability |
| High partition count | More partitions = more chances for one to be slow |

## Suggested Optimizations

### 1. Progressive Reply (High Priority)

Instead of waiting for ALL partitions to be ready, send BEGIN to the client after the **first partition** is ready. Stream data as partitions become available.

**Current behavior**: Client waits for slowest partition
**Proposed behavior**: Client gets data from ready partitions while others catch up

### 2. Configurable Partition Readiness Timeout

Add a configuration option to proceed after a timeout if some partitions are not ready:

```yaml
options:
  topics:
    - name: my-topic
      partitionReadyTimeout: 5s  # Proceed after 5s even if some partitions not ready
```

### 3. Smarter OFFSET_OUT_OF_RANGE Handling

Current code (line ~2975 in `KafkaClientFetchFactory.java`):

```java
case ERROR_OFFSET_OUT_OF_RANGE:
    // TODO: recover at EARLIEST or LATEST ?
    nextOffset = OFFSET_HISTORICAL;
    client.encoder = client.encodeOffsetsRequest;
```

The `// TODO` comment indicates this is a known issue. Consider:
- Making the recovery behavior configurable (EARLIEST vs LATEST)
- Avoiding repeated ListOffsets cycles
- Faster fallback to LATEST for live consumers

### 4. Reduce Default Reconnect Delay

Current default `cache.server.reconnect=5` seconds is too aggressive for real-time streaming.

Suggestion: Reduce to 1 second or make it adaptive based on error type.

### 5. Document Bootstrap Behavior

The `cache.server.bootstrap=true` default means cache must warm up before serving. This should be clearly documented, and users should be advised to set `defaultOffset: live` for real-time use cases.

## Workarounds (Current)

Users can mitigate with:

```properties
# Reduce reconnect delay
zilla.binding.kafka.cache.server.reconnect=1

# Use live offset to skip historical replay
# (in zilla.yaml topic config)
defaultOffset: live

# Optionally disable bootstrap
zilla.binding.kafka.cache.server.bootstrap=false
```

## Debugging

Enable debug logging with:

```bash
-Dzilla.binding.kafka.debug=true
```

Look for:
- `FETCH reconnect in Xs` - indicates reconnect delays
- `FETCH disconnect` - partition stream failures
- Gaps in partition ready messages

## Related Files

- `runtime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaMergedFactory.java`
- `runtime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaCacheBootstrapFactory.java`
- `runtime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaClientFetchFactory.java`
- `runtime/binding-kafka/src/main/java/io/aklivity/zilla/runtime/binding/kafka/internal/stream/KafkaCacheServerFetchFactory.java`

## Environment

- Zilla version: Latest (develop branch)
- Kafka: Any version with aggressive topic configs
- Bindings: SSE-Kafka, WebSocket (via HTTP-Kafka)

## Labels

`enhancement` `kafka` `sse` `performance`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSE/WebSocket connections stall waiting for all Kafka partitions to become ready #1628

Summary

Problem Description

Key Gating Code

Impact

Scenarios That Trigger Delays

Suggested Optimizations

1. Progressive Reply (High Priority)

2. Configurable Partition Readiness Timeout

3. Smarter OFFSET_OUT_OF_RANGE Handling

4. Reduce Default Reconnect Delay

5. Document Bootstrap Behavior

Workarounds (Current)

Debugging

Related Files

Environment

Labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kafka Config	Why It Causes Issues
`retention.ms=3600000` (1h)	Frequent log deletions cause `OFFSET_OUT_OF_RANGE`
`segment.ms=600000` (10m)	Frequent segment rolls trigger leader changes
`min.cleanable.dirty.ratio=0.1`	Aggressive compaction causes metadata instability
High partition count	More partitions = more chances for one to be slow

SSE/WebSocket connections stall waiting for all Kafka partitions to become ready #1628

Description

Summary

Problem Description

Key Gating Code

Impact

Scenarios That Trigger Delays

Suggested Optimizations

1. Progressive Reply (High Priority)

2. Configurable Partition Readiness Timeout

3. Smarter OFFSET_OUT_OF_RANGE Handling

4. Reduce Default Reconnect Delay

5. Document Bootstrap Behavior

Workarounds (Current)

Debugging

Related Files

Environment

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions