Skip to content

[v25.3.x] Decom can cancel simple node add raft0 reconfigurations#30661

Open
vbotbuildovich wants to merge 3 commits into
redpanda-data:v25.3.xfrom
vbotbuildovich:ai-backport-pr-30377-v25.3.x-1780336611
Open

[v25.3.x] Decom can cancel simple node add raft0 reconfigurations#30661
vbotbuildovich wants to merge 3 commits into
redpanda-data:v25.3.xfrom
vbotbuildovich:ai-backport-pr-30377-v25.3.x-1780336611

Conversation

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Backport of PR #30377

  • Command: git cherry-pick -x 1f884b8 6ad1fe9 8edf586 3c6dbbb
  • Commits backported: 4
  • Conflicts resolved: 2
  • Commits skipped (already on target): 0
  • Backport branch: ai-backport-pr-30377-v25.3.x-1780336611

Conflict details

  • 1f884b8 (src/v/cluster/members_backend.cc): v25.3.x still uses the legacy operator<<(std::ostream&, partition_reallocation&) formatter while the source branch had migrated to a format_to member; placed the new cancel_raft0_add definition before the existing operator<< and kept v25.3.x's formatter style.
  • 6ad1fe9 (src/v/cluster/controller.cc): adjacent include lines differed — v25.3.x has #include "security/acl.h" directly after raft/fwd.h, the source commit inserts #include "raft/types.h"; kept both in alphabetical order.

@vbotbuildovich vbotbuildovich requested a review from a team as a code owner June 1, 2026 17:58
@vbotbuildovich vbotbuildovich added this to the v25.3.x-next milestone Jun 1, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Jun 1, 2026
@vbotbuildovich

vbotbuildovich commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

Retry command for Build#85191

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/throttled_raft0_test.py::StuckRaft0LearnerTest.test_decommission_cancels_in_flight_raft0_add

@vbotbuildovich

vbotbuildovich commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

CI test results

test results on build#85191
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL StuckRaft0LearnerTest test_decommission_cancels_in_flight_raft0_add null integration https://buildkite.com/redpanda/redpanda/builds/85191#019e846f-f118-4d3c-b473-edd98ad4fad7 0/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=StuckRaft0LearnerTest&test_method=test_decommission_cancels_in_flight_raft0_add
FAIL StuckRaft0LearnerTest test_decommission_cancels_in_flight_raft0_add null integration https://buildkite.com/redpanda/redpanda/builds/85191#019e8475-ef56-40ee-a0e3-c4ff430fbda6 0/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=StuckRaft0LearnerTest&test_method=test_decommission_cancels_in_flight_raft0_add
test results on build#86182
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL ClusterConfigLegacyDefaultTest test_removal_of_legacy_default_defaulted {"wipe_cache": true} integration https://buildkite.com/redpanda/redpanda/builds/86182#019ef5d9-2ff1-49bb-8e57-e0d8c42a02aa 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigLegacyDefaultTest&test_method=test_removal_of_legacy_default_defaulted
FAIL CompactionRecoveryUpgradeTest test_index_recovery_after_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/86182#019ef5d9-2ff1-49bb-8e57-e0d8c42a02aa 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CompactionRecoveryUpgradeTest&test_method=test_index_recovery_after_upgrade
FLAKY(FAIL) DataTransformsLoggingTest test_log_topic_integrity null integration https://buildkite.com/redpanda/redpanda/builds/86182#019ef5dd-efaf-4ad9-9d3c-e3dee544b68c 9/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataTransformsLoggingTest&test_method=test_log_topic_integrity
FAIL SchemaScaleTest schema_scale_test {"catalog_type": "nessie", "cloud_storage_type": 1, "query_engine": "trino", "use_partition_spec": false} integration https://buildkite.com/redpanda/redpanda/builds/86182#019ef5d9-2ff1-49bb-8e57-e0d8c42a02aa 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaScaleTest&test_method=schema_scale_test
FLAKY(PASS) NodesDecommissioningTest test_decommission_status null integration https://buildkite.com/redpanda/redpanda/builds/86182#019ef5dd-efa9-4f0e-9be2-5a4bffcc23e9 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0625, p0=0.4755, reject_threshold=0.0100. adj_baseline=0.1760, p1=0.4524, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommission_status
FLAKY(PASS) PartitionMovementUpgradeTest test_basic_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/86182#019ef5d9-2ff3-48b7-a199-1f99202d7eae 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionMovementUpgradeTest&test_method=test_basic_upgrade
FAIL StuckRaft0LearnerTest test_decommission_cancels_in_flight_raft0_add null integration https://buildkite.com/redpanda/redpanda/builds/86182#019ef5d9-2ff3-48b7-a199-1f99202d7eae 0/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=StuckRaft0LearnerTest&test_method=test_decommission_cancels_in_flight_raft0_add
FAIL StuckRaft0LearnerTest test_decommission_cancels_in_flight_raft0_add null integration https://buildkite.com/redpanda/redpanda/builds/86182#019ef5dd-efaf-4ad9-9d3c-e3dee544b68c 0/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=StuckRaft0LearnerTest&test_method=test_decommission_cancels_in_flight_raft0_add
test results on build#86237
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) EndToEndShadowIndexingTestWithDisruptions test_write_with_node_failures {"cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/86237#019efa5a-3921-40d2-9d2e-4d9cab489758 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndShadowIndexingTestWithDisruptions&test_method=test_write_with_node_failures

Allows a decommission / node removal request to cancel the addition of a
node to raft 0.

Prior behavior was that it would wait for the node to transition from
learner to voter, and then decommission would succeed. This could
deadlock if the learner dies before finishing recovery.

(cherry picked from commit 1f884b8)
Adds a configuration option which determines whether raft0 recovery
should respect learner recovery rate.

This is ill-advised for production but extremely helpful in testing for
widening race condition windows on controller operations.

Used in a subsequent commit.

(cherry picked from commit 6ad1fe9)
@joe-redpanda joe-redpanda force-pushed the ai-backport-pr-30377-v25.3.x-1780336611 branch from 819098a to 8c9abbb Compare June 23, 2026 18:28
@vbotbuildovich

vbotbuildovich commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Retry command for Build#86182

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/throttled_raft0_test.py::StuckRaft0LearnerTest.test_decommission_cancels_in_flight_raft0_add
tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingTest.test_log_topic_integrity

... test

Adds a regression test for deadlocked members_backend.

When a node is added to the cluster, joins raft0 as a learner, and dies
before it can transition to a voter, raft0 reconfigurations are blocked
until this learner can recover. This prior required either the dead
learner to recover, or node uuid override to unblock raft0
reconfiguration.

This test validates the fix, which is that a decommission on a node
which has not finished recovering (is a learner) should cancel the raft0
configuration as its 'node removal' step

This allows a decommission to serve as the escape hatch when a raft0
learner has become irrevocably lost.

(cherry picked from commit 8edf586)
@joe-redpanda joe-redpanda force-pushed the ai-backport-pr-30377-v25.3.x-1780336611 branch from 8c9abbb to adbfe60 Compare June 24, 2026 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants