Skip to content

server: fault tolerance metrics #148759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MattWhelan
Copy link
Contributor

For each element of the locality tree, generate a gauge metric indicating the number of additional nodes that can fail, if that locality were to fail completely. The raw values for these metrics are not meaningful. They must be aggregated across all nodes within a failure domain to indicate the actual fault tolerance margin.

Negative values indicate that a failure in this domain will cause at least one range to become unavailable. 0 indicates that this domain can fail without causing unavailability. Postitive values indicate the worst-case number of additional replicas that need to become unavailable to cause a range to become unavailable.

Epic: none
Fixes: https://cockroachlabs.atlassian.net/browse/TREQ-1099

Release note (ops change): the new fault_tolerance.nodes metric provides a view into the fault tolerance state of the cluster. The metric is produced for each locality. By taking the min of the value within a locality, you can determine the number of additional nodes that can fail if that locality fails, before any unavailability. This is the "fault tolerance margin" for that locality. This metric is responsive to node liveness changes and changes in range allocation.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@MattWhelan MattWhelan force-pushed the faultToleranceMonitoring branch 3 times, most recently from bf3c32e to d2a1727 Compare June 24, 2025 23:17
For each element of the locality tree, generate a gauge metric
indicating the number of additional nodes that can fail, if that
locality were to fail completely. The raw values for these metrics are
not meaningful. They must be aggregated across all nodes within a
failure domain to indicate the actual fault tolerance margin.

Negative values indicate that a failure in this domain will cause at
least one range to become unavailable. 0 indicates that this domain can
fail without causing unavailability. Postitive values indicate the
worst-case number of additional replicas that need to become unavailable
to cause a range to become unavailable.

Epic: none
Fixes: https://cockroachlabs.atlassian.net/browse/TREQ-1099

Release note (ops change): the new `fault_tolerance.nodes` metric provides
a view into the fault tolerance state of the cluster. The metric is
produced for each locality. By taking the `min` of the value within a
locality, you can determine the number of additional nodes that can fail
if that locality fails, before any unavailability. This is the "fault
tolerance margin" for that locality. This metric is responsive to node
liveness changes and changes in range allocation.
@MattWhelan MattWhelan force-pushed the faultToleranceMonitoring branch from d2a1727 to c60341e Compare June 25, 2025 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants