Redistribute request max bytes across remote fetch partitions in DelayedShareFetch#22740
Open
adixitconfluent wants to merge 1 commit into
Open
Redistribute request max bytes across remote fetch partitions in DelayedShareFetch#22740adixitconfluent wants to merge 1 commit into
adixitconfluent wants to merge 1 commit into
Conversation
…ch in share groups
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
About
When a share fetch request spans a mix of local-log and remote (tiered) partitions, the
byte caps for the remote reads are computed while the budget is still divided across
all acquired partitions, and are never revised after the non-remote partitions drop out.
The sequence in
DelayedShareFetch.tryComplete():maybeReadFromLogcomputes per-partition max bytes viaPartitionMaxBytesStrategyas
requestMaxBytes / N, where N = all acquired partitions (deliberately, sopartitions fetched later don't starve).
ReplicaManagerbakes that value intoRemoteStorageFetchInfo.fetchMaxBytes(and the nestedPartitionData.maxBytes).maybeProcessRemoteFetchthen releases the N−R non-remote partitions withoutfetching them — they consume none of the budget in this request — but the R remote
reads proceed with their stale
requestMaxBytes / Ncaps unchanged.The leftover budget is only partially recovered by the compensating local read in
completeRemoteStorageShareFetchRequest(maxBytes − readableBytes). If the releasedpartitions cannot be re-acquired at completion time (e.g. grabbed by a concurrent share
fetch during the remote read — a wide window) or have no data, the response goes out
well below the request budget.
Example:
maxBytes= 1 MB, 10 partitions acquired, 1 remote. The remote read is cappedat ~100 KB; if the 9 local partitions yield nothing at completion, the response carries
~100 KB of a 1 MB budget. For tiered-storage-heavy share groups this means more round
trips for the same data.
Change
In
processRemoteFetchOrException, once the remote-only partition set is known,recompute the per-partition budget with
partitionMaxBytesStrategy.maxBytes(requestMaxBytes, remotePartitions, remotePartitions.size())and rebuild each
RemoteStorageFetchInfowith the raised cap before scheduling theRemoteLogManager.asyncRead.Notes on the implementation:
RemoteLogManager.readclamps the read atMath.min(fetchMaxBytes, fetchInfo.maxBytes), so raising only the top-levelfetchMaxByteswould be silently clamped by the nested partition-level value. Thenested
PartitionDatahere is broker-synthesized (share fetch has no client-setper-partition max bytes), so raising it does not affect any client contract.
cap, the original
RemoteStorageFetchInfois returned untouched, so theredistribution can only grant a remote read more budget than before.
requestMaxBytes, and the follow-up local read incompleteRemoteStorageShareFetchRequestalready sizes itself from the actual remotebytes returned (
maxBytes − readableBytes), so the total response cannot exceed therequest budget.
Trade-off:
RemoteLogManager.readeagerly allocates a buffer of the effective cap, somixed local/remote requests now allocate larger transient buffers (up to the full
request max bytes instead of a 1/N share). This does not raise the existing per-read
ceiling — a request whose only acquired partition is remote already received the full
budget — and concurrency remains bounded by the remote reader thread pool.
Testing
testRemoteStorageFetchMaxBytesResizedToRemoteFetchPartitions: 3 acquiredpartitions (2 local, 1 remote) with the real UNIFORM strategy, asserting the remote
read is scheduled with both byte caps raised to the full request budget and all other
fetch-info fields carried over unchanged.