Skip to content

Node stuck in DECOMMISSIONING state when upscale interrupts an ongoing downscale #1138

@kos-team

Description

@kos-team

Description

A CockroachDB node can become permanently stuck in the DECOMMISSIONING membership state if the user scales up the cluster (cockroach_cr.nodes) while a previous downscale operation is still in progress.

The operator logic for ReconcileDecommssion relies on the condition currentReplicas > cr.nodes. If a user upscales (making cr.nodes > currentReplicas) while a node is in the intermediate DECOMMISSIONING state, the operator exits the reconcile loop for decommissioning. Consequently, the node is neither fully decommissioned nor returned to ACTIVE status.

Root Cause Analysis

The issue lies in the entry conditions for the ReconcileDecommssion action. The operator triggers this action only if:

  1. The Cluster is initialized.
  2. stsStatus.replicas == stsStatus.currentReplicas
  3. stsStatus.currentReplicas > cockroach_cr.nodes (Intent to downscale).

The Failure Scenario:

  1. User initiates downscale (e.g., 5 -> 3). Condition (3) is met.
  2. Operator calls Decommission(node). The node enters the DECOMMISSIONING state.
  3. Decommission(node) returns an error (e.g., network timeout, data moving too slowly). The operator returns early to retry later.
  4. User initiates upscale (e.g., 3 -> 5).
  5. On the next reconcile, Condition (3) evaluates to false because currentReplicas is no longer greater than cr.replicas.
  6. The operator skips the ReconcileDecommssion block entirely. The node remains DECOMMISSIONING indefinitely.

Steps to Reproduce

Prerequisites:

  • A running CockroachDB cluster managed by the operator (5 nodes).
  • (Optional) ChaosMesh installed to inject network faults.
  1. Initialize Workload:
    Run the movr workload to generate sufficient data:
cockroach workload init movr --num-histories 1000000 --num-rides 100000 --num-users 100000 --num-vehicles 100000
  1. Inject Fault:
    Use ChaosMesh to limit the Pod bandwidth to 1kbps. This ensures the decommissioning process stalls or errors out due to slow data replication.
  2. Trigger Downscale:
    Update the CockroachDB CR to reduce the replica count (e.g., 5 -> 3).
  3. Verify State:
    Wait until the target node enters the DECOMMISSIONING state:
cockroach node status --insecure --decommission
  1. Trigger Upscale:
    Immediately update the CockroachDB CR to increase the replica count (e.g., back to 5 or higher).

Observed Behavior

The node previously targeted for removal remains in the DECOMMISSIONING state while new nodes are added. It does not revert to ACTIVE.

Log Output:

bash-5.1$ cockroach node status --insecure --decommission
  id |                          address                          |                        sql_address                        |  build  |              started_at              |              updated_at              | locality | attrs | is_available | is_live | gossiped_replicas | is_decommissioning |   membership    | is_draining
-----+-----------------------------------------------------------+-----------------------------------------------------------+---------+--------------------------------------+--------------------------------------+----------+-------+--------------+---------+-------------------+--------------------+-----------------+--------------
   1 | cockroachdb-0.cockroachdb.cockroach-operator-system:26258 | cockroachdb-0.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:54:10.174444 +0000 UTC | 2026-01-12 03:07:17.954084 +0000 UTC |          | []    | true         | true    |               104 | false              | active          | false
   2 | cockroachdb-1.cockroachdb.cockroach-operator-system:26258 | cockroachdb-1.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:55:32.643908 +0000 UTC | 2026-01-12 03:07:17.797383 +0000 UTC |          | []    | true         | true    |                98 | false              | active          | false
   3 | cockroachdb-2.cockroachdb.cockroach-operator-system:26258 | cockroachdb-2.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:55:34.215002 +0000 UTC | 2026-01-12 03:07:16.41831 +0000 UTC  |          | []    | true         | true    |               100 | false              | active          | false
   4 | cockroachdb-4.cockroachdb.cockroach-operator-system:26258 | cockroachdb-4.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:39:19.128573 +0000 UTC | 2026-01-12 03:07:16.193822 +0000 UTC |          | []    | true         | true    |                40 | true               | decommissioning | false
   5 | cockroachdb-3.cockroachdb.cockroach-operator-system:26258 | cockroachdb-3.cockroachdb.cockroach-operator-system:26257 | v25.4.2 | 2026-01-12 02:56:23.412081 +0000 UTC | 2026-01-12 03:07:18.188583 +0000 UTC |          | []    | true         | true    |               104 | false              | active          | false
(5 rows)

Expected Behavior

If the operator detects that the desired replica count has increased (upscale) while a node is currently DECOMMISSIONING:

  1. The operator should detect the intermediate state.
  2. It should explicitly recommission the node (cancel the decommission) to return it to ACTIVE status before proceeding with the upscale.

Severity

Major
(But this is not a production failure)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions