Skip to content

cloud_topics: gate tiered_cloud on full upgrade and add tiered_v1/tiered_v2 storage modes#30966

Open
Lazin wants to merge 7 commits into
redpanda-data:devfrom
Lazin:ct/potato-flag
Open

cloud_topics: gate tiered_cloud on full upgrade and add tiered_v1/tiered_v2 storage modes#30966
Lazin wants to merge 7 commits into
redpanda-data:devfrom
Lazin:ct/potato-flag

Conversation

@Lazin

@Lazin Lazin commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

This PR prepares the tiered_cloud storage mode (Cloud Topics based tiered
storage) for the 26.2 release: it makes the mode safe during rolling upgrades
and introduces its user-facing tiered_v1/tiered_v2 vocabulary.

Upgrade gating. The tiered_cloud_topics feature flag switches from
explicit_only to always, so it activates exactly when every broker runs
v26.2. While the cluster is only partially upgraded, creating a tiered_cloud
topic or converting a cloud topic to tiered_cloud is rejected (the
CreateTopics, AlterConfigs and IncrementalAlterConfigs gates already existed);
once the upgrade completes both operations work with no admin action. The same
flag now gates the L0 notifier: the post-compaction
set_min_allowed_local_threshold stm command is new in 26.2 and older
replicas cannot apply it, so until the flag activates every notification
attempt is a no-op that reports success. This is safe because no tiered_cloud
topic — the only reader of local data below the floor — can exist while the
flag is inactive, so the compaction sink can commit normally.

tiered_v1 / tiered_v2 vocabulary. A new cluster config
cloud_storage_default_mode (tiered_v1 | tiered_v2) selects which storage
mode the plain tiered value of the redpanda.storage.mode topic property
refers to: tiered_v1 is the classic Tiered Storage architecture, tiered_v2
is the Cloud Topics based one (internally still the tiered_cloud enum value,
which is not renamed). The explicit spellings always pick their variant, and
the internal tiered_cloud spelling is no longer accepted as input. On
describe, the variant matching the cluster config displays as tiered while
the other displays under its real name, computed at describe time. New
clusters default to tiered_v2; clusters upgraded from pre-26.2 keep
tiered_v1 through a legacy default, so an upgrade never changes what
tiered means.

Ducktape coverage: a mixed-version upgrade test (tiered_cloud creation and
cloud→tiered_cloud conversion rejected mid-upgrade, allowed after full
upgrade; upgraded clusters keep the tiered_v1 default) and a full matrix
test of both cloud_storage_default_mode values over every
redpanda.storage.mode input, including display re-labeling when the config
flips and alias resolution through the alter path.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Features

  • New cloud_storage_default_mode cluster config (tiered_v1 | tiered_v2)
    selects whether the tiered value of the redpanda.storage.mode topic
    property refers to classic Tiered Storage (tiered_v1) or Cloud Topics
    based tiered storage (tiered_v2). Freshly deployed clusters default to
    tiered_v2; upgraded clusters keep tiered_v1. The explicit
    tiered_v1/tiered_v2 values select a variant directly, and topics using
    tiered_v2 (or conversions from cloud) can only be created once the
    whole cluster runs v26.2.

Copilot AI review requested due to automatic review settings June 30, 2026 17:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new cluster feature gate to prevent a 26.2+ leader from replicating the new cloud-topics compaction set_min_allowed_local_threshold (MASH floor) command to pre-26.2 replicas during rolling upgrades, avoiding “unknown command key” failures on older nodes.

Changes:

  • Adds the ctp_min_allowed_local_threshold cluster feature (gated on the 26.2 logical version) and wires it into level_zero_notifier to short-circuit replication until the cluster is fully upgraded.
  • Introduces ctp_stm_api_errc::feature_disabled and threads it through relevant call sites (notifier, frontend, compaction sink) with appropriate retry/logging behavior.
  • Adds/updates unit and rptest coverage and Bazel dependencies to validate the feature gate behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/rptest/tests/cluster_features_test.py Excludes the new feature from an upgrade-status test that doesn’t exercise feature-specific setups.
src/v/features/feature_table.h Registers the new ctp_min_allowed_local_threshold feature and schema entry.
src/v/features/feature_table.cc Adds string conversion for the new feature.
src/v/cloud_topics/reconciler/reconciliation_source.cc Handles the new feature_disabled error code in a switch.
src/v/cloud_topics/level_zero/stm/ctp_stm_api.h Adds ctp_stm_api_errc::feature_disabled and formatting.
src/v/cloud_topics/level_zero/notifier/tests/level_zero_notifier_test.cc Adds a unit test validating notifier gating when the feature is inactive.
src/v/cloud_topics/level_zero/notifier/tests/BUILD Adds Bazel dependency on //src/v/features for the new test.
src/v/cloud_topics/level_zero/notifier/level_zero_notifier.h Extends notifier wiring to accept a feature_table pointer and adds a gate helper.
src/v/cloud_topics/level_zero/notifier/level_zero_notifier.cc Implements feature gate checks and returns feature_disabled when inactive.
src/v/cloud_topics/level_zero/notifier/BUILD Adds Bazel dependency on //src/v/features for notifier feature checks.
src/v/cloud_topics/level_one/maintenance/compaction/compaction_sink.cc Downgrades log severity for expected feature_disabled during upgrades.
src/v/cloud_topics/frontend/frontend.cc Treats feature_disabled like other transient errors in error mapping.
src/v/cloud_topics/app.cc Wires controller feature table into the level-zero notifier construction.

Comment on lines 106 to 108
case ctp_stm_api_errc::failure:
case ctp_stm_api_errc::feature_disabled:
co_return std::unexpected(errc::failure);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^^ makes sense to me, but I'm not sure how reconciler treats failure vs timeout differently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reconciler never replicates the command that can trigger this error

@Lazin Lazin requested review from dotnwat and oleiman June 30, 2026 19:17
@vbotbuildovich

vbotbuildovich commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#86520
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) PrefixTruncateRecoveryTest test_prefix_truncate_recovery {"acks": -1, "start_empty": false} integration https://buildkite.com/redpanda/redpanda/builds/86520#019f19c3-a306-411a-8f8e-2206b33e13fd 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PrefixTruncateRecoveryTest&test_method=test_prefix_truncate_recovery
FLAKY(PASS) TxAtomicProduceConsumeTest test_basic_tx_consumer_transform_produce {"with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/86520#019f19c3-a306-411a-8f8e-2206b33e13fd 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0029, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxAtomicProduceConsumeTest&test_method=test_basic_tx_consumer_transform_produce
test results on build#86549
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkTopicFailoverTests test_producer_ids_failover {"storage_mode": "cloud"} integration https://buildkite.com/redpanda/redpanda/builds/86549#019f1a8d-0294-40b2-b8d2-84defc4bad58 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0069, p0=0.1293, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3917, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkTopicFailoverTests&test_method=test_producer_ids_failover
FLAKY(PASS) ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "cloud", "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/86549#019f1a8f-e62d-476a-8a2f-43499345668a 36/41 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0406, p0=0.0779, reject_threshold=0.0100. adj_baseline=0.1168, p1=0.4919, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
FLAKY(INCONCLUSIVE) NodeWiseRecoveryTest test_node_wise_recovery {"dead_node_count": 2} integration https://buildkite.com/redpanda/redpanda/builds/86549#019f1a8d-0298-411f-be02-8b543622c80f 17/20 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0212, p0=0.1926, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.7361, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_node_wise_recovery
test results on build#86734
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL TSWithAlreadyCompactedTopic test_initial_upload null integration https://buildkite.com/redpanda/redpanda/builds/86734#019f2a13-7f33-49e5-bf1f-6b8f86819069 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TSWithAlreadyCompactedTopic&test_method=test_initial_upload
FAIL ShadowLinkingRandomOpsTest test_node_operations {"failures": false, "workload_set": "flipping"} integration https://buildkite.com/redpanda/redpanda/builds/86734#019f2a13-7f35-45c3-a77a-1e763eee50c4 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL ShadowLinkingRandomOpsTest test_node_operations {"failures": true, "workload_set": "flipping"} integration https://buildkite.com/redpanda/redpanda/builds/86734#019f2a13-7f37-4b1e-bdc6-ad776bb2c3a5 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL src/v/security/tests/acl_store_fuzz src/v/security/tests/acl_store_fuzz unit https://buildkite.com/redpanda/redpanda/builds/86734#019f29f5-f8cf-427a-a03e-d02273094b89 0/1

@Lazin

Lazin commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

force push: rebase with dev

@Lazin Lazin requested a review from WillemKauf July 1, 2026 11:43

@oleiman oleiman left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo open question about how the reconciler should treat the error

Comment on lines 106 to 108
case ctp_stm_api_errc::failure:
case ctp_stm_api_errc::feature_disabled:
co_return std::unexpected(errc::failure);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^^ makes sense to me, but I'm not sure how reconciler treats failure vs timeout differently.

Comment thread src/v/features/feature_table.h Outdated
shadow_link_sr_api_sync = 1ULL << 16U,
iceberg_extended_mode_config = 1ULL << 17U,
fetch_controller_snapshot_rpc = 1ULL << 18U,
ctp_min_allowed_local_threshold = 1ULL << 21U,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: that's a weird place to put it

@Lazin Lazin requested a review from oleiman July 2, 2026 15:07
# Gates replication of the cloud-topics compaction min-allowed-local-
# threshold command; exercising it needs cloud-topics compaction setup
# orthogonal to the finalization behavior under test.
"ctp_min_allowed_local_threshold",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to either (1) introduce testing for this feature in the context of unfinalized upgrade or (2) explain here why it isn't needed.

Lazin added 3 commits July 3, 2026 13:59
Switch the flag from explicit_only to always so it activates as soon
as every node runs v26.2, with no admin action. The existing create
and alter-config gates keep rejecting tiered_cloud topics (and
cloud -> tiered_cloud conversions) while the cluster is only partially
upgraded, and open automatically once the upgrade completes.

set_feature_active now tolerates always-policy features jumping
straight from unavailable to active without an observable available
state.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
The set_min_allowed_local_threshold stm command is new in v26.2, so
the compaction sink must not replicate it while pre-v26.2 replicas may
still be in the raft group. Gate the notifier on the
tiered_cloud_topics feature flag: until it activates (the cluster is
fully upgraded to v26.2) every notification attempt reports success
without replicating. This is safe because no tiered_cloud topic -- the
only reader of local data below the floor -- can exist while the flag
is inactive.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Start on the latest v26.1 release, upgrade a single node, and verify
that creating a tiered_cloud topic or converting a cloud topic to
tiered_cloud is rejected while the cluster is partially upgraded, then
succeeds once every node runs v26.2 and the flag auto-activates.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
@Lazin Lazin closed this Jul 3, 2026
@Lazin Lazin force-pushed the ct/potato-flag branch from c59c254 to 96ee14f Compare July 3, 2026 18:55
@Lazin

Lazin commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

How come that it's closed?

@Lazin

Lazin commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Reopening

Lazin added 4 commits July 3, 2026 17:39
Introduce the user-facing vocabulary for the tiered_cloud rename
without touching the redpanda_storage_mode enum. A new
cloud_storage_default_mode enum (tiered_v1 | tiered_v2) selects which
variant the plain 'tiered' alias refers to.

redpanda_storage_mode_from_user_string resolves the alias and rejects
the internal 'tiered_cloud' spelling; redpanda_storage_mode_user_name
displays the variant matching the default mode as 'tiered' and the
other one under its real name. The context-free parser additionally
accepts tiered_v1/tiered_v2 as static aliases.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Selects the meaning and display name of the 'tiered' storage mode
alias. Defaults to tiered_v2 (the Cloud Topics architecture) on new
clusters; a legacy default keeps tiered_v1 on any cluster whose
original version predates v26.2, so upgrades never change what
'tiered' means.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
CreateTopics, AlterConfigs and IncrementalAlterConfigs parse
redpanda.storage.mode through the user vocabulary: 'tiered' resolves
via cloud_storage_default_mode, tiered_v1/tiered_v2 pick their variant
explicitly, and the internal 'tiered_cloud' spelling is rejected.
DescribeConfigs renders the variant matching the cluster config as
'tiered' and the other one under its real name, computed at describe
time. Error messages use the new names.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
StorageModeAliasMatrixTest checks every cloud_storage_default_mode
value against every redpanda.storage.mode input: create outcome and
displayed mode, re-labeling of existing topics when the config flips
(proving which variant the alias stored), and alias resolution through
the alter path.

TopicSpec.STORAGE_MODE_TIERED_CLOUD now aliases the tiered_v2 spelling
since redpanda no longer accepts 'tiered_cloud'. Suites that rely on
the classic meaning and display name of 'tiered' pin
cloud_storage_default_mode=tiered_v1, and the tiered_cloud upgrade
test asserts that an upgraded cluster keeps the tiered_v1 legacy
default.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
@Lazin Lazin reopened this Jul 3, 2026
@Lazin Lazin requested a review from a team as a code owner July 3, 2026 21:49
@Lazin Lazin changed the title cloud_topics/l0: gate min allowed local threshold replication cloud_topics: gate tiered_cloud on full upgrade and add tiered_v1/tiered_v2 storage modes Jul 3, 2026
@vbotbuildovich

vbotbuildovich commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Retry command for Build#86734

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":true,"workload_set":"flipping"}
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":false,"workload_set":"flipping"}
tests/rptest/tests/shadow_indexing_compacted_topic_test.py::TSWithAlreadyCompactedTopic.test_initial_upload

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants