tests: scale coverage for shadow-link role sync by nguyen-andrew · Pull Request #30978 · redpanda-data/redpanda

nguyen-andrew · 2026-07-01T02:10:28Z

Part of ENG-898 (shadowing roles). Adds scale coverage for the role-sync
migrator (CORE-16456), exercising sync against a large role set well beyond
current fleet usage (which tops out around ~200 roles per cluster).

tests: ShadowLinkRoleSyncScaleTest drives the migrator through a
full create/update/delete lifecycle at 5000 total members across three
cost axes -- many roles (5000x1), wide membership (50x100), and a large
per-role member set (5x1000) -- asserting the destination mirrors the
source at each phase and the task settles ACTIVE.

Drive-by fix surfaced by the scale test: security_manager::fill_snapshot
previously had a per-role get() loop that rescanned the member store
on every call (quadratic) and could stall the controller reactor under large
role counts. Added role_store::all_roles_with_members() to allow it to
snapshot the roles in a single pass over the member store and yields
periodically. This is preventative hardening rather than a reported bug -
the stall only bites well above today's fleet usage.

Fixes CORE-16768

Backports Required

Release Notes

none

Copilot

Pull request overview

This PR hardens controller snapshotting of RBAC role data to avoid reactor stalls and quadratic behavior when role membership is large, and adds an end-to-end scale test to exercise the shadow-link roles migrator across multiple high-cardinality shapes.

Changes:

Add security::role_store::all_roles_with_members() to build a complete roles+members view in one pass over the member store, with periodic cooperative yields.
Update cluster::security_manager::fill_snapshot() to snapshot roles via the new single-pass/yielding path.
Add ShadowLinkRoleSyncScaleTest ducktape coverage that drives create/update/delete lifecycle at scale across three parametrized role/member distributions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
tests/rptest/tests/shadow_link_role_sync_test.py	Adds a parametrized scale characterization test for shadow-link role sync across multiple cardinality profiles.
src/v/security/role.cc	Implements `role_store::all_roles_with_members()` as a yielding, single-pass roles+members enumerator.
src/v/security/role_store.h	Declares the new async/yielding enumeration API with usage constraints documented.
src/v/cluster/security_manager.cc	Switches snapshot role enumeration to `all_roles_with_members()` and yields during snapshot construction.

nguyen-andrew · 2026-07-01T02:16:37Z

+    // IMPORTANT: intended solely for security_manager::fill_snapshot. This
+    // suspends mid-iteration, which is safe ONLY while the caller holds
+    // mux_state_machine's _apply_mtx: that mutex serializes against command
+    // application, so _roles/_members_store can't mutate across a yield. A
+    // caller without that lock risks iterating a container that changes
+    // underfoot.
+    //
+    // Like roles_with_members, but enumerates every role with its members in a
+    // single pass and yields periodically, so snapshotting a very large role
+    // store doesn't stall the reactor.
+    ss::future<chunked_vector<role_with_members>>
+    all_roles_with_members() const;


@dotnwat picking up your question from #30946 (#30946 (comment)) about what keeps the loops in all_roles_with_members safe against concurrent modification now that they can suspend.

The only intended caller is security_manager::fill_snapshot, which should only run while the controller apply mutex is held. Looks like applying commands takes that same mutex, so command application and snapshotting should be mutually exclusive and the role store shouldn't mutate between yields.

It's a little loose imo, but this seems to be the existing convention for fill_snapshot. The other calls next to it also iterate the live stores for credentials and ACLs and yield the same way - and it looks like they're subtly relying on this apply mutex as well. I've tried to shape the comment here to spell out the apply-mutex precondition and note that it's meant solely for fill_snapshot, but not sure if that's solid enough or if fill_snapshot as a whole should be revisited.

nguyen-andrew · 2026-07-01T02:17:34Z

    // Ephemeral credentials must not be stored in the snapshot.
    auto creds = _credentials.local().range(
      security::credential_store::is_not_ephemeral);
    for (const auto& cred : creds) {
        ss::visit(cred.second, [&](security::scram_credential scram) {
            snapshot.user_credentials.push_back(
              user_and_credential{
                security::credential_user{cred.first}, std::move(scram)});
        });
        co_await ss::coroutine::maybe_yield();
    }

    snapshot.acls = co_await _authorizer.local().all_bindings();


@dotnwat these are what I was referring to in my other comment.

vbotbuildovich · 2026-07-01T04:32:33Z

Retry command for Build#86563

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":true,"workload_set":"cloud_combos"}

vbotbuildovich · 2026-07-01T04:42:49Z

CI test results

test results on build#86563

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	ShadowLinkingReplicationTests	test_auto_prefix_trimming	{"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "cloud", "with_failures": true}	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b88-ff29-48b8-ad0c-26acefeff3bc	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0391, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1127, p1=0.3025, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
FLAKY(PASS)	TestVirtualConnections	test_no_head_of_line_blocking	{"different_clusters": false, "different_connections": true}	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TestVirtualConnections&test_method=test_no_head_of_line_blocking
FLAKY(PASS)	DataMigrationsApiTest	test_creating_and_listing_migrations	null	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
FLAKY(PASS)	EndToEndSpilloverTest	test_spillover	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=EndToEndSpilloverTest&test_method=test_spillover
FLAKY(PASS)	FollowerFetchingTest	test_follower_fetching_with_maintenance_mode	{"fetch_from": "fetch-from-local"}	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FollowerFetchingTest&test_method=test_follower_fetching_with_maintenance_mode
FLAKY(PASS)	LimitsTest	test_kafka_request_max_size	null	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LimitsTest&test_method=test_kafka_request_max_size
FLAKY(PASS)	OffsetRetentionTest	test_offset_expiration	null	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=OffsetRetentionTest&test_method=test_offset_expiration
FLAKY(PASS)	PandaProxyTest	test_produce_topic_validation	null	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PandaProxyTest&test_method=test_produce_topic_validation
FLAKY(PASS)	RpkConfigTest	test_config_set_json	null	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkConfigTest&test_method=test_config_set_json
FLAKY(PASS)	SchemaRegistryConfluentClient	test_references	null	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaRegistryConfluentClient&test_method=test_references
FLAKY(PASS)	SchemaRegistryRpcTransportTest	test_schema_id_validation	{"client_type": 1, "compression_type": "zstd", "payload_class": "com.redpanda.Payload", "protocol": 3, "subject_name_strategy": "io.confluent.kafka.serializers.subject.TopicRecordNameStrategy", "validate_schema_id": true}	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaRegistryRpcTransportTest&test_method=test_schema_id_validation
FLAKY(PASS)	SchemaRegistryTest	test_normalize	{"dataset_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaRegistryTest&test_method=test_normalize
FLAKY(PASS)	SchemaRegistryTest	test_schema_id_validation	{"client_type": 1, "compression_type": "zstd", "payload_class": "com.redpanda.A.B.C.D.NestedPayload", "protocol": 2, "subject_name_strategy": "io.confluent.kafka.serializers.subject.TopicNameStrategy", "validate_schema_id": true}	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SchemaRegistryTest&test_method=test_schema_id_validation
FLAKY(FAIL)	ShadowLinkingRandomOpsTest	test_node_operations	{"failures": true, "workload_set": "cloud_combos"}	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b88-ff2c-4ecf-b9a3-7b929f813386	35/41	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0188, p0=0.0009, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FLAKY(PASS)	CloudTopicsTimeQueryTest	test_timequery	null	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsTimeQueryTest&test_method=test_timequery
FLAKY(PASS)	StreamVerifierTest	test_simple_produce_consume_txn_with_add_node	null	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=StreamVerifierTest&test_method=test_simple_produce_consume_txn_with_add_node
FLAKY(PASS)	TxAdminTest	test_simple_get_transaction	null	integration	https://buildkite.com/redpanda/redpanda/builds/86563#019f1b85-e3bc-4a0e-b229-cba0a9041414	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxAdminTest&test_method=test_simple_get_transaction

test results on build#86619

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	ShadowLinkCustomStartOffsetSelectionTests	test_starting_offset	{"failures": true, "source_cluster_spec": {"cluster_type": "redpanda"}, "starting_offset": "timestamp", "storage_mode": "cloud"}	integration	https://buildkite.com/redpanda/redpanda/builds/86619#019f1f32-f5cd-43ba-b3cd-59973563a4a0	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkCustomStartOffsetSelectionTests&test_method=test_starting_offset
FLAKY(FAIL)	ShutdownTest	test_timely_shutdown_with_failures	null	integration	https://buildkite.com/redpanda/redpanda/builds/86619#019f1f32-f5d1-4db0-987c-a581fc5df094	8/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0027, p0=0.0003, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShutdownTest&test_method=test_timely_shutdown_with_failures

test results on build#86633

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	ShadowLinkingReplicationTests	test_auto_prefix_trimming	{"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "tiered_cloud", "with_failures": true}	integration	https://buildkite.com/redpanda/redpanda/builds/86633#019f2080-0fcc-4878-8e9b-5534815b6994	19/21	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0428, p0=0.5835, reject_threshold=0.0100. adj_baseline=0.1231, p1=0.2751, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

+    // IMPORTANT: intended solely for security_manager::fill_snapshot. This
+    // suspends mid-iteration, which is safe ONLY while the caller holds
+    // mux_state_machine's _apply_mtx: that mutex serializes against command
+    // application, so _roles/_members_store can't mutate across a yield. A
+    // caller without that lock risks iterating a container that changes
+    // underfoot.
+    //
+    // Like roles_with_members, but enumerates every role with its members in a
+    // single pass and yields periodically, so snapshotting a very large role
+    // store doesn't stall the reactor.
+    ss::future<chunked_vector<role_with_members>>
+    all_roles_with_members() const;


vbotbuildovich · 2026-07-01T20:35:10Z

Retry command for Build#86619

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/timely_shutdown_test.py::ShutdownTest.test_timely_shutdown_with_failures

fill_snapshot enumerated roles by calling role_store::get() once per role, and get() rescans the entire member store on every call, so building the snapshot was O(roles * members). At thousands of roles this monopolizes the controller reactor long enough to starve Raft heartbeats and destabilize controller leadership. Switch it to all_roles_with_members(), a new async sibling of roles_with_members() that builds every role's members in one pass over the member store and yields periodically. roles_with_members() stays synchronous for the read paths, which run without the controller apply mutex; yielding mid-iteration is safe only in the snapshot path, which holds that mutex, so no command mutates the store across the yields. The stall surfaced under the role-sync scale test added in the next commit, which drives thousands of roles through a full create, update, and delete lifecycle.

Add ShadowLinkRoleSyncScaleTest, which drives the roles migrator through a full create, update, and delete lifecycle at volume. Three parametrized points each stress one cost axis at a constant 5000 total members: many roles (5000 x 1), wide membership (50 x 100), and a large single member set (5 x 1000). Each phase asserts the destination mirrors the source, and the task settles ACTIVE at the end. Source mutations retry the transient not_leader/timeout errors that thousands of rapid role writes can provoke, and the controller-log guard is raised to admit the ~3*num_roles records the lifecycle writes.

Seed a role on the source, enable role_sync_options via rpk, and assert the role and its member land on the shadow cluster. Depends on the rpk role-sync config support cherry-picked above.

nguyen-andrew · 2026-07-02T01:10:04Z

Force pushes:

Copilot AI review requested due to automatic review settings July 1, 2026 02:10

github-actions Bot added the area/redpanda label Jul 1, 2026

Copilot started reviewing on behalf of nguyen-andrew July 1, 2026 02:10 View session

Copilot AI reviewed Jul 1, 2026

View reviewed changes

nguyen-andrew commented Jul 1, 2026

View reviewed changes

nguyen-andrew self-assigned this Jul 1, 2026

nguyen-andrew changed the title ~~security: avoid reactor stalls snapshotting large role stores~~ tests: scale coverage for shadow-link role sync Jul 1, 2026

nguyen-andrew requested review from dotnwat and pgellert July 1, 2026 02:39

nguyen-andrew force-pushed the sl-role-sync-scale branch from beca13a to 6326d14 Compare July 1, 2026 19:15

github-actions Bot added the area/build label Jul 1, 2026

nguyen-andrew requested a review from Copilot July 1, 2026 20:05

Copilot started reviewing on behalf of nguyen-andrew July 1, 2026 20:06 View session

Copilot AI reviewed Jul 1, 2026

View reviewed changes

nguyen-andrew added 3 commits July 1, 2026 23:14

tests: cover role shadowing in rpk shadow-link e2e

f8b7d82

Seed a role on the source, enable role_sync_options via rpk, and assert the role and its member land on the shadow cluster. Depends on the rpk role-sync config support cherry-picked above.

nguyen-andrew force-pushed the sl-role-sync-scale branch from 6326d14 to 58ec891 Compare July 2, 2026 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tests: scale coverage for shadow-link role sync#30978

tests: scale coverage for shadow-link role sync#30978
nguyen-andrew wants to merge 3 commits into
redpanda-data:devfrom
nguyen-andrew:sl-role-sync-scale

nguyen-andrew commented Jul 1, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

nguyen-andrew Jul 1, 2026

Uh oh!

nguyen-andrew Jul 1, 2026

Uh oh!

vbotbuildovich commented Jul 1, 2026

Uh oh!

vbotbuildovich commented Jul 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

vbotbuildovich commented Jul 1, 2026

Uh oh!

nguyen-andrew commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

nguyen-andrew commented Jul 1, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nguyen-andrew Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

nguyen-andrew Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Jul 1, 2026

Retry command for Build#86563

Uh oh!

vbotbuildovich commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

vbotbuildovich commented Jul 1, 2026

Retry command for Build#86619

Uh oh!

nguyen-andrew commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nguyen-andrew commented Jul 1, 2026 •

edited by atlassian Bot

Loading

vbotbuildovich commented Jul 1, 2026 •

edited

Loading