cloud_topics: gate tiered_cloud on full upgrade and add tiered_v1/tiered_v2 storage modes#30966
cloud_topics: gate tiered_cloud on full upgrade and add tiered_v1/tiered_v2 storage modes#30966Lazin wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new cluster feature gate to prevent a 26.2+ leader from replicating the new cloud-topics compaction set_min_allowed_local_threshold (MASH floor) command to pre-26.2 replicas during rolling upgrades, avoiding “unknown command key” failures on older nodes.
Changes:
- Adds the
ctp_min_allowed_local_thresholdcluster feature (gated on the 26.2 logical version) and wires it intolevel_zero_notifierto short-circuit replication until the cluster is fully upgraded. - Introduces
ctp_stm_api_errc::feature_disabledand threads it through relevant call sites (notifier, frontend, compaction sink) with appropriate retry/logging behavior. - Adds/updates unit and rptest coverage and Bazel dependencies to validate the feature gate behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/rptest/tests/cluster_features_test.py | Excludes the new feature from an upgrade-status test that doesn’t exercise feature-specific setups. |
| src/v/features/feature_table.h | Registers the new ctp_min_allowed_local_threshold feature and schema entry. |
| src/v/features/feature_table.cc | Adds string conversion for the new feature. |
| src/v/cloud_topics/reconciler/reconciliation_source.cc | Handles the new feature_disabled error code in a switch. |
| src/v/cloud_topics/level_zero/stm/ctp_stm_api.h | Adds ctp_stm_api_errc::feature_disabled and formatting. |
| src/v/cloud_topics/level_zero/notifier/tests/level_zero_notifier_test.cc | Adds a unit test validating notifier gating when the feature is inactive. |
| src/v/cloud_topics/level_zero/notifier/tests/BUILD | Adds Bazel dependency on //src/v/features for the new test. |
| src/v/cloud_topics/level_zero/notifier/level_zero_notifier.h | Extends notifier wiring to accept a feature_table pointer and adds a gate helper. |
| src/v/cloud_topics/level_zero/notifier/level_zero_notifier.cc | Implements feature gate checks and returns feature_disabled when inactive. |
| src/v/cloud_topics/level_zero/notifier/BUILD | Adds Bazel dependency on //src/v/features for notifier feature checks. |
| src/v/cloud_topics/level_one/maintenance/compaction/compaction_sink.cc | Downgrades log severity for expected feature_disabled during upgrades. |
| src/v/cloud_topics/frontend/frontend.cc | Treats feature_disabled like other transient errors in error mapping. |
| src/v/cloud_topics/app.cc | Wires controller feature table into the level-zero notifier construction. |
| case ctp_stm_api_errc::failure: | ||
| case ctp_stm_api_errc::feature_disabled: | ||
| co_return std::unexpected(errc::failure); |
There was a problem hiding this comment.
^^^ makes sense to me, but I'm not sure how reconciler treats failure vs timeout differently.
There was a problem hiding this comment.
the reconciler never replicates the command that can trigger this error
|
force push: rebase with dev |
oleiman
left a comment
There was a problem hiding this comment.
lgtm modulo open question about how the reconciler should treat the error
| case ctp_stm_api_errc::failure: | ||
| case ctp_stm_api_errc::feature_disabled: | ||
| co_return std::unexpected(errc::failure); |
There was a problem hiding this comment.
^^^ makes sense to me, but I'm not sure how reconciler treats failure vs timeout differently.
| shadow_link_sr_api_sync = 1ULL << 16U, | ||
| iceberg_extended_mode_config = 1ULL << 17U, | ||
| fetch_controller_snapshot_rpc = 1ULL << 18U, | ||
| ctp_min_allowed_local_threshold = 1ULL << 21U, |
There was a problem hiding this comment.
nitpick: that's a weird place to put it
| # Gates replication of the cloud-topics compaction min-allowed-local- | ||
| # threshold command; exercising it needs cloud-topics compaction setup | ||
| # orthogonal to the finalization behavior under test. | ||
| "ctp_min_allowed_local_threshold", |
There was a problem hiding this comment.
Need to either (1) introduce testing for this feature in the context of unfinalized upgrade or (2) explain here why it isn't needed.
Switch the flag from explicit_only to always so it activates as soon as every node runs v26.2, with no admin action. The existing create and alter-config gates keep rejecting tiered_cloud topics (and cloud -> tiered_cloud conversions) while the cluster is only partially upgraded, and open automatically once the upgrade completes. set_feature_active now tolerates always-policy features jumping straight from unavailable to active without an observable available state. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
The set_min_allowed_local_threshold stm command is new in v26.2, so the compaction sink must not replicate it while pre-v26.2 replicas may still be in the raft group. Gate the notifier on the tiered_cloud_topics feature flag: until it activates (the cluster is fully upgraded to v26.2) every notification attempt reports success without replicating. This is safe because no tiered_cloud topic -- the only reader of local data below the floor -- can exist while the flag is inactive. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Start on the latest v26.1 release, upgrade a single node, and verify that creating a tiered_cloud topic or converting a cloud topic to tiered_cloud is rejected while the cluster is partially upgraded, then succeeds once every node runs v26.2 and the flag auto-activates. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
|
How come that it's closed? |
|
Reopening |
Introduce the user-facing vocabulary for the tiered_cloud rename without touching the redpanda_storage_mode enum. A new cloud_storage_default_mode enum (tiered_v1 | tiered_v2) selects which variant the plain 'tiered' alias refers to. redpanda_storage_mode_from_user_string resolves the alias and rejects the internal 'tiered_cloud' spelling; redpanda_storage_mode_user_name displays the variant matching the default mode as 'tiered' and the other one under its real name. The context-free parser additionally accepts tiered_v1/tiered_v2 as static aliases. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Selects the meaning and display name of the 'tiered' storage mode alias. Defaults to tiered_v2 (the Cloud Topics architecture) on new clusters; a legacy default keeps tiered_v1 on any cluster whose original version predates v26.2, so upgrades never change what 'tiered' means. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
CreateTopics, AlterConfigs and IncrementalAlterConfigs parse redpanda.storage.mode through the user vocabulary: 'tiered' resolves via cloud_storage_default_mode, tiered_v1/tiered_v2 pick their variant explicitly, and the internal 'tiered_cloud' spelling is rejected. DescribeConfigs renders the variant matching the cluster config as 'tiered' and the other one under its real name, computed at describe time. Error messages use the new names. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
StorageModeAliasMatrixTest checks every cloud_storage_default_mode value against every redpanda.storage.mode input: create outcome and displayed mode, re-labeling of existing topics when the config flips (proving which variant the alias stored), and alias resolution through the alter path. TopicSpec.STORAGE_MODE_TIERED_CLOUD now aliases the tiered_v2 spelling since redpanda no longer accepts 'tiered_cloud'. Suites that rely on the classic meaning and display name of 'tiered' pin cloud_storage_default_mode=tiered_v1, and the tiered_cloud upgrade test asserts that an upgraded cluster keeps the tiered_v1 legacy default. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Retry command for Build#86734please wait until all jobs are finished before running the slash command |
This PR prepares the
tiered_cloudstorage mode (Cloud Topics based tieredstorage) for the 26.2 release: it makes the mode safe during rolling upgrades
and introduces its user-facing
tiered_v1/tiered_v2vocabulary.Upgrade gating. The
tiered_cloud_topicsfeature flag switches fromexplicit_onlytoalways, so it activates exactly when every broker runsv26.2. While the cluster is only partially upgraded, creating a
tiered_cloudtopic or converting a
cloudtopic totiered_cloudis rejected (theCreateTopics, AlterConfigs and IncrementalAlterConfigs gates already existed);
once the upgrade completes both operations work with no admin action. The same
flag now gates the L0 notifier: the post-compaction
set_min_allowed_local_thresholdstm command is new in 26.2 and olderreplicas cannot apply it, so until the flag activates every notification
attempt is a no-op that reports success. This is safe because no tiered_cloud
topic — the only reader of local data below the floor — can exist while the
flag is inactive, so the compaction sink can commit normally.
tiered_v1 / tiered_v2 vocabulary. A new cluster config
cloud_storage_default_mode(tiered_v1|tiered_v2) selects which storagemode the plain
tieredvalue of theredpanda.storage.modetopic propertyrefers to:
tiered_v1is the classic Tiered Storage architecture,tiered_v2is the Cloud Topics based one (internally still the
tiered_cloudenum value,which is not renamed). The explicit spellings always pick their variant, and
the internal
tiered_cloudspelling is no longer accepted as input. Ondescribe, the variant matching the cluster config displays as
tieredwhilethe other displays under its real name, computed at describe time. New
clusters default to
tiered_v2; clusters upgraded from pre-26.2 keeptiered_v1through a legacy default, so an upgrade never changes whattieredmeans.Ducktape coverage: a mixed-version upgrade test (tiered_cloud creation and
cloud→tiered_cloud conversion rejected mid-upgrade, allowed after full
upgrade; upgraded clusters keep the
tiered_v1default) and a full matrixtest of both
cloud_storage_default_modevalues over everyredpanda.storage.modeinput, including display re-labeling when the configflips and alias resolution through the alter path.
Backports Required
Release Notes
Features
cloud_storage_default_modecluster config (tiered_v1|tiered_v2)selects whether the
tieredvalue of theredpanda.storage.modetopicproperty refers to classic Tiered Storage (
tiered_v1) or Cloud Topicsbased tiered storage (
tiered_v2). Freshly deployed clusters default totiered_v2; upgraded clusters keeptiered_v1. The explicittiered_v1/tiered_v2values select a variant directly, and topics usingtiered_v2(or conversions fromcloud) can only be created once thewhole cluster runs v26.2.