-
Notifications
You must be signed in to change notification settings - Fork 102
Description
Description
A CockroachDB node can become permanently stuck in the DECOMMISSIONING membership state if the user scales up the cluster (cockroach_cr.nodes) while a previous downscale operation is still in progress.
The operator logic for ReconcileDecommssion relies on the condition currentReplicas > cr.nodes. If a user upscales (making cr.nodes > currentReplicas) while a node is in the intermediate DECOMMISSIONING state, the operator exits the reconcile loop for decommissioning. Consequently, the node is neither fully decommissioned nor returned to ACTIVE status.
Root Cause Analysis
The issue lies in the entry conditions for the ReconcileDecommssion action. The operator triggers this action only if:
- The Cluster is initialized.
stsStatus.replicas == stsStatus.currentReplicasstsStatus.currentReplicas > cockroach_cr.nodes(Intent to downscale).
The Failure Scenario:
- User initiates downscale (e.g., 5 -> 3). Condition (3) is met.
- Operator calls
Decommission(node). The node enters theDECOMMISSIONINGstate. Decommission(node)returns an error (e.g., network timeout, data moving too slowly). The operator returns early to retry later.- User initiates upscale (e.g., 3 -> 5).
- On the next reconcile, Condition (3) evaluates to
falsebecausecurrentReplicasis no longer greater thancr.replicas. - The operator skips the
ReconcileDecommssionblock entirely. The node remainsDECOMMISSIONINGindefinitely.
Steps to Reproduce
Prerequisites:
- A running CockroachDB cluster managed by the operator (5 nodes).
- (Optional) ChaosMesh installed to inject network faults.
- Initialize Workload:
Run themovrworkload to generate sufficient data:
cockroach workload init movr --num-histories 1000000 --num-rides 100000 --num-users 100000 --num-vehicles 100000- Inject Fault:
Use ChaosMesh to limit the Pod bandwidth to 1kbps. This ensures the decommissioning process stalls or errors out due to slow data replication. - Trigger Downscale:
Update theCockroachDBCR to reduce the replica count (e.g., 5 -> 3). - Verify State:
Wait until the target node enters theDECOMMISSIONINGstate:
cockroach node status --insecure --decommission
- Trigger Upscale:
Immediately update theCockroachDBCR to increase the replica count (e.g., back to 5 or higher).
Observed Behavior
The node previously targeted for removal remains in the DECOMMISSIONING state while new nodes are added. It does not revert to ACTIVE.
Log Output:
bash-5.1$ cockroach node status --insecure --decommission
id | address | sql_address | build | started_at | updated_at | locality | attrs | is_available | is_live | gossiped_replicas | is_decommissioning | membership | is_draining
-----+-----------------------------------------------------------+-----------------------------------------------------------+---------+--------------------------------------+--------------------------------------+----------+-------+--------------+---------+-------------------+--------------------+-----------------+--------------
1 | cockroachdb-0.cockroachdb.cockroach-operator-system:26258 | cockroachdb-0.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:54:10.174444 +0000 UTC | 2026-01-12 03:07:17.954084 +0000 UTC | | [] | true | true | 104 | false | active | false
2 | cockroachdb-1.cockroachdb.cockroach-operator-system:26258 | cockroachdb-1.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:55:32.643908 +0000 UTC | 2026-01-12 03:07:17.797383 +0000 UTC | | [] | true | true | 98 | false | active | false
3 | cockroachdb-2.cockroachdb.cockroach-operator-system:26258 | cockroachdb-2.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:55:34.215002 +0000 UTC | 2026-01-12 03:07:16.41831 +0000 UTC | | [] | true | true | 100 | false | active | false
4 | cockroachdb-4.cockroachdb.cockroach-operator-system:26258 | cockroachdb-4.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:39:19.128573 +0000 UTC | 2026-01-12 03:07:16.193822 +0000 UTC | | [] | true | true | 40 | true | decommissioning | false
5 | cockroachdb-3.cockroachdb.cockroach-operator-system:26258 | cockroachdb-3.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:56:23.412081 +0000 UTC | 2026-01-12 03:07:18.188583 +0000 UTC | | [] | true | true | 104 | false | active | false
(5 rows)
Expected Behavior
If the operator detects that the desired replica count has increased (upscale) while a node is currently DECOMMISSIONING:
- The operator should detect the intermediate state.
- It should explicitly recommission the node (cancel the decommission) to return it to
ACTIVEstatus before proceeding with the upscale.
Severity
Major
(But this is not a production failure)