diff --git a/src/current/_includes/v25.2/essential-alerts.md b/src/current/_includes/v25.2/essential-alerts.md index dd311c760c6..dbc9dddc8fe 100644 --- a/src/current/_includes/v25.2/essential-alerts.md +++ b/src/current/_includes/v25.2/essential-alerts.md @@ -318,9 +318,9 @@ Send an alert when the number of ranges with replication below the replication f - Refer to [Replication issues]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#replication-issues). -### Requests stuck in raft +### Requests stuck in Raft -Send an alert when requests are taking a very long time in replication. An (evaluated) request has to pass through the replication layer, notably the quota pool and raft. If it fails to do so within a highly permissive duration, the gauge is incremented (and decremented again once the request is either applied or returns an error). A nonzero value indicates range or replica unavailability, and should be investigated. +Send an alert when requests are taking a very long time in replication. An (evaluated) request has to pass through the replication layer, notably the quota pool and raft. If it fails to do so within a highly permissive duration, the gauge is incremented (and decremented again once the request is either applied or returns an error). A nonzero value indicates range or replica unavailability, and should be investigated. This can also be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits). **Metric**
`requests.slow.raft` diff --git a/src/current/_includes/v25.2/leader-leases-node-heartbeat-use-cases.md b/src/current/_includes/v25.2/leader-leases-node-heartbeat-use-cases.md index 481c9220a35..878f4c2a3c2 100644 --- a/src/current/_includes/v25.2/leader-leases-node-heartbeat-use-cases.md +++ b/src/current/_includes/v25.2/leader-leases-node-heartbeat-use-cases.md @@ -1,5 +1,7 @@ {% include_cached new-in.html version="v25.2" %} For the purposes of [Raft replication]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) and determining the [leaseholder]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) of a [range]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-range), node health is no longer determined by heartbeating a single "liveness range"; instead it is determined using [Leader leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases). + + However, node heartbeats of a single range are still used to determine: - Whether a node is still a member of a cluster (this is used by [`cockroach node decommission`]({% link {{ page.version.version }}/cockroach-node.md %}#node-decommission)). diff --git a/src/current/_includes/v25.3/essential-alerts.md b/src/current/_includes/v25.3/essential-alerts.md index dd311c760c6..dbc9dddc8fe 100644 --- a/src/current/_includes/v25.3/essential-alerts.md +++ b/src/current/_includes/v25.3/essential-alerts.md @@ -318,9 +318,9 @@ Send an alert when the number of ranges with replication below the replication f - Refer to [Replication issues]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#replication-issues). -### Requests stuck in raft +### Requests stuck in Raft -Send an alert when requests are taking a very long time in replication. An (evaluated) request has to pass through the replication layer, notably the quota pool and raft. If it fails to do so within a highly permissive duration, the gauge is incremented (and decremented again once the request is either applied or returns an error). A nonzero value indicates range or replica unavailability, and should be investigated. +Send an alert when requests are taking a very long time in replication. An (evaluated) request has to pass through the replication layer, notably the quota pool and raft. If it fails to do so within a highly permissive duration, the gauge is incremented (and decremented again once the request is either applied or returns an error). A nonzero value indicates range or replica unavailability, and should be investigated. This can also be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits). **Metric**
`requests.slow.raft` diff --git a/src/current/_includes/v25.3/leader-leases-node-heartbeat-use-cases.md b/src/current/_includes/v25.3/leader-leases-node-heartbeat-use-cases.md index f6b248da5be..65434afa994 100644 --- a/src/current/_includes/v25.3/leader-leases-node-heartbeat-use-cases.md +++ b/src/current/_includes/v25.3/leader-leases-node-heartbeat-use-cases.md @@ -1,5 +1,7 @@ For the purposes of [Raft replication]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) and determining the [leaseholder]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-leaseholder) of a [range]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-range), node health is no longer determined by heartbeating a single "liveness range"; instead it is determined using [Leader leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leases). + + However, node heartbeats of a single range are still used to determine: - Whether a node is still a member of a cluster (this is used by [`cockroach node decommission`]({% link {{ page.version.version }}/cockroach-node.md %}#node-decommission)). diff --git a/src/current/v25.2/architecture/replication-layer.md b/src/current/v25.2/architecture/replication-layer.md index 911f20d56c5..a92e5590c83 100644 --- a/src/current/v25.2/architecture/replication-layer.md +++ b/src/current/v25.2/architecture/replication-layer.md @@ -160,6 +160,15 @@ Unlike table data, system ranges use expiration-based leases; expiration-based l Expiration-based leases are also used temporarily during operations like lease transfers, until the new Raft leader can be fortified based on store liveness, as described in [Leader leases](#leader-leases). +#### Leader‑leaseholder splits + +[Epoch-based leases](#epoch-based-leases) (unlike [Leader leases](#leader-leases)) are vulnerable to _leader-leaseholder splits_. These can occur when a leaseholder's Raft log has fallen behind other replicas in its group and it cannot acquire Raft leadership. Coupled with a [network partition]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#network-partition), this split can cause permanent unavailability of the range if (1) the stale leaseholder continues heartbeating the [liveness range](#liveness-range) to hold its lease but (2) cannot reach the leader to propose writes. + +Symptoms of leader-leaseholder splits include a [stalled Raft log]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#requests-stuck-in-raft) on the leaseholder and [increased disk usage]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disks-filling-up) on follower replicas buffering pending Raft entries. Remediations include: + +- Restarting the affected nodes. +- Fixing the network partition (or slow networking) between nodes. + #### Leader leases {% include_cached new-in.html version="v25.2" %} {% include {{ page.version.version }}/leader-leases-intro.md %} diff --git a/src/current/v25.2/cluster-setup-troubleshooting.md b/src/current/v25.2/cluster-setup-troubleshooting.md index 68e1a47844a..ffe0f9cb20c 100644 --- a/src/current/v25.2/cluster-setup-troubleshooting.md +++ b/src/current/v25.2/cluster-setup-troubleshooting.md @@ -372,6 +372,8 @@ Like any database system, if you run out of disk space the system will no longer - [Why is disk usage increasing despite lack of writes?]({% link {{ page.version.version }}/operational-faqs.md %}#why-is-disk-usage-increasing-despite-lack-of-writes) - [Can I reduce or disable the storage of timeseries data?]({% link {{ page.version.version }}/operational-faqs.md %}#can-i-reduce-or-disable-the-storage-of-time-series-data) +In rare cases, disk usage can increase on nodes with [Raft followers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) due to a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits). + ###### Automatic ballast files CockroachDB automatically creates an emergency ballast file at [node startup]({% link {{ page.version.version }}/cockroach-start.md %}). This feature is **on** by default. Note that the [`cockroach debug ballast`]({% link {{ page.version.version }}/cockroach-debug-ballast.md %}) command is still available but deprecated. diff --git a/src/current/v25.2/monitoring-and-alerting.md b/src/current/v25.2/monitoring-and-alerting.md index 6fee772d1a7..565f8ebf0ed 100644 --- a/src/current/v25.2/monitoring-and-alerting.md +++ b/src/current/v25.2/monitoring-and-alerting.md @@ -1206,7 +1206,7 @@ Currently, not all events listed have corresponding alert rule definitions avail #### Requests stuck in Raft -- **Rule:** Send an alert when requests are taking a very long time in replication. +- **Rule:** Send an alert when requests are taking a very long time in replication. This can be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits). - **How to detect:** Calculate this using the `requests_slow_raft` metric in the node's `_status/vars` output. diff --git a/src/current/v25.2/ui-slow-requests-dashboard.md b/src/current/v25.2/ui-slow-requests-dashboard.md index e262c6a0c49..c1875279b42 100644 --- a/src/current/v25.2/ui-slow-requests-dashboard.md +++ b/src/current/v25.2/ui-slow-requests-dashboard.md @@ -29,7 +29,7 @@ Hovering over the graph displays values for the following metrics: Metric | Description --------|---- -Slow Raft Proposals | The number of requests that have been stuck for longer than usual in [Raft]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft), as tracked by the `requests.slow.raft` metric. +Slow Raft Proposals | The number of requests that have been stuck for longer than usual in [Raft]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft), as tracked by the `requests.slow.raft` metric. This can be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits). ## Slow DistSender RPCs diff --git a/src/current/v25.3/architecture/replication-layer.md b/src/current/v25.3/architecture/replication-layer.md index 48289dbb35e..7c035825cdd 100644 --- a/src/current/v25.3/architecture/replication-layer.md +++ b/src/current/v25.3/architecture/replication-layer.md @@ -160,6 +160,15 @@ Unlike table data, system ranges use expiration-based leases; expiration-based l Expiration-based leases are also used temporarily during operations like lease transfers, until the new Raft leader can be fortified based on store liveness, as described in [Leader leases](#leader-leases). +#### Leader‑leaseholder splits + +[Epoch-based leases](#epoch-based-leases) (unlike [Leader leases](#leader-leases)) are vulnerable to _leader-leaseholder splits_. These can occur when a leaseholder's Raft log has fallen behind other replicas in its group and it cannot acquire Raft leadership. Coupled with a [network partition]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#network-partition), this split can cause permanent unavailability of the range if (1) the stale leaseholder continues heartbeating the [liveness range](#liveness-range) to hold its lease but (2) cannot reach the leader to propose writes. + +Symptoms of leader-leaseholder splits include a [stalled Raft log]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#requests-stuck-in-raft) on the leaseholder and [increased disk usage]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disks-filling-up) on follower replicas buffering pending Raft entries. Remediations include: + +- Restarting the affected nodes. +- Fixing the network partition (or slow networking) between nodes. + #### Leader leases {% include {{ page.version.version }}/leader-leases-intro.md %} diff --git a/src/current/v25.3/cluster-setup-troubleshooting.md b/src/current/v25.3/cluster-setup-troubleshooting.md index 70058c4bd7a..8368afb1eb4 100644 --- a/src/current/v25.3/cluster-setup-troubleshooting.md +++ b/src/current/v25.3/cluster-setup-troubleshooting.md @@ -372,6 +372,8 @@ Like any database system, if you run out of disk space the system will no longer - [Why is disk usage increasing despite lack of writes?]({% link {{ page.version.version }}/operational-faqs.md %}#why-is-disk-usage-increasing-despite-lack-of-writes) - [Can I reduce or disable the storage of timeseries data?]({% link {{ page.version.version }}/operational-faqs.md %}#can-i-reduce-or-disable-the-storage-of-time-series-data) +In rare cases, disk usage can increase on nodes with [Raft followers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) due to a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits). + ###### Automatic ballast files CockroachDB automatically creates an emergency ballast file at [node startup]({% link {{ page.version.version }}/cockroach-start.md %}). This feature is **on** by default. Note that the [`cockroach debug ballast`]({% link {{ page.version.version }}/cockroach-debug-ballast.md %}) command is still available but deprecated. diff --git a/src/current/v25.3/monitoring-and-alerting.md b/src/current/v25.3/monitoring-and-alerting.md index 6fee772d1a7..565f8ebf0ed 100644 --- a/src/current/v25.3/monitoring-and-alerting.md +++ b/src/current/v25.3/monitoring-and-alerting.md @@ -1206,7 +1206,7 @@ Currently, not all events listed have corresponding alert rule definitions avail #### Requests stuck in Raft -- **Rule:** Send an alert when requests are taking a very long time in replication. +- **Rule:** Send an alert when requests are taking a very long time in replication. This can be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits). - **How to detect:** Calculate this using the `requests_slow_raft` metric in the node's `_status/vars` output. diff --git a/src/current/v25.3/ui-slow-requests-dashboard.md b/src/current/v25.3/ui-slow-requests-dashboard.md index e262c6a0c49..c1875279b42 100644 --- a/src/current/v25.3/ui-slow-requests-dashboard.md +++ b/src/current/v25.3/ui-slow-requests-dashboard.md @@ -29,7 +29,7 @@ Hovering over the graph displays values for the following metrics: Metric | Description --------|---- -Slow Raft Proposals | The number of requests that have been stuck for longer than usual in [Raft]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft), as tracked by the `requests.slow.raft` metric. +Slow Raft Proposals | The number of requests that have been stuck for longer than usual in [Raft]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft), as tracked by the `requests.slow.raft` metric. This can be a symptom of a [leader-leaseholder split]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leader-leaseholder-splits). ## Slow DistSender RPCs