server: fault tolerance metrics #148759

MattWhelan · 2025-06-24T19:50:27Z

For each element of the locality tree, generate a gauge metric indicating the number of additional nodes that can fail, if that locality were to fail completely. The raw values for these metrics are not meaningful. They must be aggregated across all nodes within a failure domain to indicate the actual fault tolerance margin.

Negative values indicate that a failure in this domain will cause at least one range to become unavailable. 0 indicates that this domain can fail without causing unavailability. Postitive values indicate the worst-case number of additional replicas that need to become unavailable to cause a range to become unavailable.

Epic: none
Fixes: https://cockroachlabs.atlassian.net/browse/TREQ-1099

Release note (ops change): the new fault_tolerance.nodes metric provides a view into the fault tolerance state of the cluster. The metric is produced for each locality. By taking the min of the value within a locality, you can determine the number of additional nodes that can fail if that locality fails, before any unavailability. This is the "fault tolerance margin" for that locality. This metric is responsive to node liveness changes and changes in range allocation.

cockroach-teamcity · 2025-06-24T19:50:36Z

This change is

For each element of the locality tree, generate a gauge metric indicating the number of additional nodes that can fail, if that locality were to fail completely. The raw values for these metrics are not meaningful. They must be aggregated across all nodes within a failure domain to indicate the actual fault tolerance margin. Negative values indicate that a failure in this domain will cause at least one range to become unavailable. 0 indicates that this domain can fail without causing unavailability. Postitive values indicate the worst-case number of additional replicas that need to become unavailable to cause a range to become unavailable. Epic: none Fixes: https://cockroachlabs.atlassian.net/browse/TREQ-1099 Release note (ops change): the new `fault_tolerance.nodes` metric provides a view into the fault tolerance state of the cluster. The metric is produced for each locality. By taking the `min` of the value within a locality, you can determine the number of additional nodes that can fail if that locality fails, before any unavailability. This is the "fault tolerance margin" for that locality. This metric is responsive to node liveness changes and changes in range allocation.

MattWhelan force-pushed the faultToleranceMonitoring branch 3 times, most recently from bf3c32e to d2a1727 Compare June 24, 2025 23:17

MattWhelan force-pushed the faultToleranceMonitoring branch from d2a1727 to c60341e Compare June 25, 2025 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: fault tolerance metrics #148759

server: fault tolerance metrics #148759

MattWhelan commented Jun 24, 2025

Uh oh!

cockroach-teamcity commented Jun 24, 2025

Uh oh!

Uh oh!

server: fault tolerance metrics #148759

Are you sure you want to change the base?

server: fault tolerance metrics #148759

Conversation

MattWhelan commented Jun 24, 2025

Uh oh!

cockroach-teamcity commented Jun 24, 2025

Uh oh!

Uh oh!