Skip to content

tests: scale coverage for shadow-link role sync#30978

Open
nguyen-andrew wants to merge 3 commits into
redpanda-data:devfrom
nguyen-andrew:sl-role-sync-scale
Open

tests: scale coverage for shadow-link role sync#30978
nguyen-andrew wants to merge 3 commits into
redpanda-data:devfrom
nguyen-andrew:sl-role-sync-scale

Conversation

@nguyen-andrew

@nguyen-andrew nguyen-andrew commented Jul 1, 2026

Copy link
Copy Markdown
Member

Part of ENG-898 (shadowing roles). Adds scale coverage for the role-sync
migrator (CORE-16456), exercising sync against a large role set well beyond
current fleet usage (which tops out around ~200 roles per cluster).

  • tests: ShadowLinkRoleSyncScaleTest drives the migrator through a
    full create/update/delete lifecycle at 5000 total members across three
    cost axes -- many roles (5000x1), wide membership (50x100), and a large
    per-role member set (5x1000) -- asserting the destination mirrors the
    source at each phase and the task settles ACTIVE.

Drive-by fix surfaced by the scale test: security_manager::fill_snapshot
previously had a per-role get() loop that rescanned the member store
on every call (quadratic) and could stall the controller reactor under large
role counts. Added role_store::all_roles_with_members() to allow it to
snapshot the roles in a single pass over the member store and yields
periodically. This is preventative hardening rather than a reported bug -
the stall only bites well above today's fleet usage.

Fixes CORE-16768

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

Copilot AI review requested due to automatic review settings July 1, 2026 02:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens controller snapshotting of RBAC role data to avoid reactor stalls and quadratic behavior when role membership is large, and adds an end-to-end scale test to exercise the shadow-link roles migrator across multiple high-cardinality shapes.

Changes:

  • Add security::role_store::all_roles_with_members() to build a complete roles+members view in one pass over the member store, with periodic cooperative yields.
  • Update cluster::security_manager::fill_snapshot() to snapshot roles via the new single-pass/yielding path.
  • Add ShadowLinkRoleSyncScaleTest ducktape coverage that drives create/update/delete lifecycle at scale across three parametrized role/member distributions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
tests/rptest/tests/shadow_link_role_sync_test.py Adds a parametrized scale characterization test for shadow-link role sync across multiple cardinality profiles.
src/v/security/role.cc Implements role_store::all_roles_with_members() as a yielding, single-pass roles+members enumerator.
src/v/security/role_store.h Declares the new async/yielding enumeration API with usage constraints documented.
src/v/cluster/security_manager.cc Switches snapshot role enumeration to all_roles_with_members() and yields during snapshot construction.

Comment on lines +149 to +160
// IMPORTANT: intended solely for security_manager::fill_snapshot. This
// suspends mid-iteration, which is safe ONLY while the caller holds
// mux_state_machine's _apply_mtx: that mutex serializes against command
// application, so _roles/_members_store can't mutate across a yield. A
// caller without that lock risks iterating a container that changes
// underfoot.
//
// Like roles_with_members, but enumerates every role with its members in a
// single pass and yields periodically, so snapshotting a very large role
// store doesn't stall the reactor.
ss::future<chunked_vector<role_with_members>>
all_roles_with_members() const;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotnwat picking up your question from #30946 (#30946 (comment)) about what keeps the loops in all_roles_with_members safe against concurrent modification now that they can suspend.

The only intended caller is security_manager::fill_snapshot, which should only run while the controller apply mutex is held. Looks like applying commands takes that same mutex, so command application and snapshotting should be mutually exclusive and the role store shouldn't mutate between yields.

It's a little loose imo, but this seems to be the existing convention for fill_snapshot. The other calls next to it also iterate the live stores for credentials and ACLs and yield the same way - and it looks like they're subtly relying on this apply mutex as well. I've tried to shape the comment here to spell out the apply-mutex precondition and note that it's meant solely for fill_snapshot, but not sure if that's solid enough or if fill_snapshot as a whole should be revisited.

Comment on lines 200 to 212
// Ephemeral credentials must not be stored in the snapshot.
auto creds = _credentials.local().range(
security::credential_store::is_not_ephemeral);
for (const auto& cred : creds) {
ss::visit(cred.second, [&](security::scram_credential scram) {
snapshot.user_credentials.push_back(
user_and_credential{
security::credential_user{cred.first}, std::move(scram)});
});
co_await ss::coroutine::maybe_yield();
}

snapshot.acls = co_await _authorizer.local().all_bindings();

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotnwat these are what I was referring to in my other comment.

@nguyen-andrew nguyen-andrew self-assigned this Jul 1, 2026
@nguyen-andrew nguyen-andrew changed the title security: avoid reactor stalls snapshotting large role stores tests: scale coverage for shadow-link role sync Jul 1, 2026
@nguyen-andrew nguyen-andrew requested review from dotnwat and pgellert July 1, 2026 02:39
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Retry command for Build#86563

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":true,"workload_set":"cloud_combos"}

@vbotbuildovich

vbotbuildovich commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#86563
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "cloud", "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b88-ff29-48b8-ad0c-26acefeff3bc 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0391, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1127, p1=0.3025, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
FLAKY(PASS) TestVirtualConnections test_no_head_of_line_blocking {"different_clusters": false, "different_connections": true} integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TestVirtualConnections&test_method=test_no_head_of_line_blocking
FLAKY(PASS) DataMigrationsApiTest test_creating_and_listing_migrations null integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
FLAKY(PASS) EndToEndSpilloverTest test_spillover {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndSpilloverTest&test_method=test_spillover
FLAKY(PASS) FollowerFetchingTest test_follower_fetching_with_maintenance_mode {"fetch_from": "fetch-from-local"} integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FollowerFetchingTest&test_method=test_follower_fetching_with_maintenance_mode
FLAKY(PASS) LimitsTest test_kafka_request_max_size null integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LimitsTest&test_method=test_kafka_request_max_size
FLAKY(PASS) OffsetRetentionTest test_offset_expiration null integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=OffsetRetentionTest&test_method=test_offset_expiration
FLAKY(PASS) PandaProxyTest test_produce_topic_validation null integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PandaProxyTest&test_method=test_produce_topic_validation
FLAKY(PASS) RpkConfigTest test_config_set_json null integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkConfigTest&test_method=test_config_set_json
FLAKY(PASS) SchemaRegistryConfluentClient test_references null integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaRegistryConfluentClient&test_method=test_references
FLAKY(PASS) SchemaRegistryRpcTransportTest test_schema_id_validation {"client_type": 1, "compression_type": "zstd", "payload_class": "com.redpanda.Payload", "protocol": 3, "subject_name_strategy": "io.confluent.kafka.serializers.subject.TopicRecordNameStrategy", "validate_schema_id": true} integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaRegistryRpcTransportTest&test_method=test_schema_id_validation
FLAKY(PASS) SchemaRegistryTest test_normalize {"dataset_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaRegistryTest&test_method=test_normalize
FLAKY(PASS) SchemaRegistryTest test_schema_id_validation {"client_type": 1, "compression_type": "zstd", "payload_class": "com.redpanda.A.B.C.D.NestedPayload", "protocol": 2, "subject_name_strategy": "io.confluent.kafka.serializers.subject.TopicNameStrategy", "validate_schema_id": true} integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaRegistryTest&test_method=test_schema_id_validation
FLAKY(FAIL) ShadowLinkingRandomOpsTest test_node_operations {"failures": true, "workload_set": "cloud_combos"} integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b88-ff2c-4ecf-b9a3-7b929f813386 35/41 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0188, p0=0.0009, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FLAKY(PASS) CloudTopicsTimeQueryTest test_timequery null integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsTimeQueryTest&test_method=test_timequery
FLAKY(PASS) StreamVerifierTest test_simple_produce_consume_txn_with_add_node null integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=StreamVerifierTest&test_method=test_simple_produce_consume_txn_with_add_node
FLAKY(PASS) TxAdminTest test_simple_get_transaction null integration https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxAdminTest&test_method=test_simple_get_transaction
test results on build#86619
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkCustomStartOffsetSelectionTests test_starting_offset {"failures": true, "source_cluster_spec": {"cluster_type": "redpanda"}, "starting_offset": "timestamp", "storage_mode": "cloud"} integration https://buildkite.com/redpanda/redpanda/builds/86619#019f1f32-f5cd-43ba-b3cd-59973563a4a0 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkCustomStartOffsetSelectionTests&test_method=test_starting_offset
FLAKY(FAIL) ShutdownTest test_timely_shutdown_with_failures null integration https://buildkite.com/redpanda/redpanda/builds/86619#019f1f32-f5d1-4db0-987c-a581fc5df094 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0027, p0=0.0003, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShutdownTest&test_method=test_timely_shutdown_with_failures
test results on build#86633
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "tiered_cloud", "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/86633#019f2080-0fcc-4878-8e9b-5534815b6994 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0428, p0=0.5835, reject_threshold=0.0100. adj_baseline=0.1231, p1=0.2751, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment on lines +149 to +160
// IMPORTANT: intended solely for security_manager::fill_snapshot. This
// suspends mid-iteration, which is safe ONLY while the caller holds
// mux_state_machine's _apply_mtx: that mutex serializes against command
// application, so _roles/_members_store can't mutate across a yield. A
// caller without that lock risks iterating a container that changes
// underfoot.
//
// Like roles_with_members, but enumerates every role with its members in a
// single pass and yields periodically, so snapshotting a very large role
// store doesn't stall the reactor.
ss::future<chunked_vector<role_with_members>>
all_roles_with_members() const;
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Retry command for Build#86619

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/timely_shutdown_test.py::ShutdownTest.test_timely_shutdown_with_failures

fill_snapshot enumerated roles by calling role_store::get() once per
role, and get() rescans the entire member store on every call, so
building the snapshot was O(roles * members). At thousands of roles
this monopolizes the controller reactor long enough to starve Raft
heartbeats and destabilize controller leadership.

Switch it to all_roles_with_members(), a new async sibling of
roles_with_members() that builds every role's members in one pass over
the member store and yields periodically. roles_with_members() stays
synchronous for the read paths, which run without the controller apply
mutex; yielding mid-iteration is safe only in the snapshot path, which
holds that mutex, so no command mutates the store across the yields.

The stall surfaced under the role-sync scale test added in the next
commit, which drives thousands of roles through a full create, update,
and delete lifecycle.
Add ShadowLinkRoleSyncScaleTest, which drives the roles migrator
through a full create, update, and delete lifecycle at volume. Three
parametrized points each stress one cost axis at a constant 5000 total
members: many roles (5000 x 1), wide membership (50 x 100), and a
large single member set (5 x 1000). Each phase asserts the destination
mirrors the source, and the task settles ACTIVE at the end.

Source mutations retry the transient not_leader/timeout errors that
thousands of rapid role writes can provoke, and the controller-log
guard is raised to admit the ~3*num_roles records the lifecycle writes.
Seed a role on the source, enable role_sync_options via rpk, and assert
the role and its member land on the shadow cluster. Depends on the rpk
role-sync config support cherry-picked above.
@nguyen-andrew

Copy link
Copy Markdown
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants