Skip to content

[v25.3.x] cluster: catch exceptions in sync_kafka_start_offset_override() & fetch path#30112

Open
vbotbuildovich wants to merge 4 commits into
redpanda-data:v25.3.xfrom
vbotbuildovich:backport-pr-30097-v25.3.x-248
Open

[v25.3.x] cluster: catch exceptions in sync_kafka_start_offset_override() & fetch path#30112
vbotbuildovich wants to merge 4 commits into
redpanda-data:v25.3.xfrom
vbotbuildovich:backport-pr-30097-v25.3.x-248

Conversation

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Backport of PR #30097

Calls made to `log_eviction_stm` and `archival_metadata_stm` within
this function can throw various exceptions (`timed_out`, shutdown
exceptions). These propagate as exceptional futures through
`validate_fetch_offset()` into the fetch handler's
`max_concurrent_for_each()` loop, causing an entire multi-partition
fetch request to fail when only one partition has an error.

Wrap the coroutine body in a try-catch so that exceptions are
returned as result errors instead. Shutdown exceptions map to
`errc::shutting_down` and all other exceptions map to
`errc::timeout`, ensuring that a failure in the fetch path for a single
partition will _only_ affect the single partition's response.

(cherry picked from commit e0d881c)
Anything is better than holding client connections open and never
returning a response here.

Also remove a now redundant exception handling location for a call to
`read_from_partition()`

(cherry picked from commit bb12c1a)
@vbotbuildovich vbotbuildovich added this to the v25.3.x-next milestone Apr 9, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Apr 9, 2026
To make this version of the function call more similar to future versions,
enabling the previous test to fail.
@vbotbuildovich

Copy link
Copy Markdown
Collaborator Author

CI test results

test results on build#83088
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) CompactionRecoveryUpgradeTest test_index_recovery_after_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/83088#019d8786-5e13-435a-9e94-8acd9ff8f5c8 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CompactionRecoveryUpgradeTest&test_method=test_index_recovery_after_upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants