Redistribute request max bytes across remote fetch partitions in DelayedShareFetch by adixitconfluent · Pull Request #22740 · apache/kafka

adixitconfluent · 2026-07-02T15:15:18Z

About

When a share fetch request spans a mix of local-log and remote (tiered) partitions, the
byte caps for the remote reads are computed while the budget is still divided across
all acquired partitions, and are never revised after the non-remote partitions drop out.

The sequence in DelayedShareFetch.tryComplete():

maybeReadFromLog computes per-partition max bytes via PartitionMaxBytesStrategy
as requestMaxBytes / N, where N = all acquired partitions (deliberately, so
partitions fetched later don't starve).
For a tiered partition, ReplicaManager bakes that value into
RemoteStorageFetchInfo.fetchMaxBytes (and the nested PartitionData.maxBytes).
maybeProcessRemoteFetch then releases the N−R non-remote partitions without
fetching them — they consume none of the budget in this request — but the R remote
reads proceed with their stale requestMaxBytes / N caps unchanged.

The leftover budget is only partially recovered by the compensating local read in
completeRemoteStorageShareFetchRequest (maxBytes − readableBytes). If the released
partitions cannot be re-acquired at completion time (e.g. grabbed by a concurrent share
fetch during the remote read — a wide window) or have no data, the response goes out
well below the request budget.

Example: maxBytes = 1 MB, 10 partitions acquired, 1 remote. The remote read is capped
at ~100 KB; if the 9 local partitions yield nothing at completion, the response carries
~100 KB of a 1 MB budget. For tiered-storage-heavy share groups this means more round
trips for the same data.

Change

In processRemoteFetchOrException, once the remote-only partition set is known,
recompute the per-partition budget with
partitionMaxBytesStrategy.maxBytes(requestMaxBytes, remotePartitions, remotePartitions.size())
and rebuild each RemoteStorageFetchInfo with the raised cap before scheduling the
RemoteLogManager.asyncRead.

Notes on the implementation:

Both caps are replaced. RemoteLogManager.read clamps the read at
Math.min(fetchMaxBytes, fetchInfo.maxBytes), so raising only the top-level
fetchMaxBytes would be silently clamped by the nested partition-level value. The
nested PartitionData here is broker-synthesized (share fetch has no client-set
per-partition max bytes), so raising it does not affect any client contract.
Caps are never lowered. If the recomputed value is not larger than the existing
cap, the original RemoteStorageFetchInfo is returned untouched, so the
redistribution can only grant a remote read more budget than before.
The response-size invariant is preserved. The resized remote caps sum to at most
requestMaxBytes, and the follow-up local read in
completeRemoteStorageShareFetchRequest already sizes itself from the actual remote
bytes returned (maxBytes − readableBytes), so the total response cannot exceed the
request budget.

Trade-off: RemoteLogManager.read eagerly allocates a buffer of the effective cap, so
mixed local/remote requests now allocate larger transient buffers (up to the full
request max bytes instead of a 1/N share). This does not raise the existing per-read
ceiling — a request whose only acquired partition is remote already received the full
budget — and concurrency remains bounded by the remote reader thread pool.

Testing

New test testRemoteStorageFetchMaxBytesResizedToRemoteFetchPartitions: 3 acquired
partitions (2 local, 1 remote) with the real UNIFORM strategy, asserting the remote
read is scheduled with both byte caps raised to the full request budget and all other
fetch-info fields carried over unchanged.

…ch in share groups

Supply maximum possible partition max bytes during remote storage fet…

d6ec955

…ch in share groups

github-actions Bot added triage PRs from the community core Kafka Broker KIP-932 Queues for Kafka labels Jul 2, 2026

sjhajharia added the ci-approved label Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redistribute request max bytes across remote fetch partitions in DelayedShareFetch#22740

Redistribute request max bytes across remote fetch partitions in DelayedShareFetch#22740
adixitconfluent wants to merge 1 commit into
apache:trunkfrom
adixitconfluent:remote_fetch_partition_max_bytes

adixitconfluent commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

adixitconfluent commented Jul 2, 2026

About

Change

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants