Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For each element of the locality tree, generate a gauge metric indicating the number of additional nodes that can fail, if that locality were to fail completely. The raw values for these metrics are not meaningful. They must be aggregated across all nodes within a failure domain to indicate the actual fault tolerance margin.
Negative values indicate that a failure in this domain will cause at least one range to become unavailable. 0 indicates that this domain can fail without causing unavailability. Postitive values indicate the worst-case number of additional replicas that need to become unavailable to cause a range to become unavailable.
Epic: none
Fixes: https://cockroachlabs.atlassian.net/browse/TREQ-1099
Release note (ops change): the new
fault_tolerance.nodes
metric provides a view into the fault tolerance state of the cluster. The metric is produced for each locality. By taking themin
of the value within a locality, you can determine the number of additional nodes that can fail if that locality fails, before any unavailability. This is the "fault tolerance margin" for that locality. This metric is responsive to node liveness changes and changes in range allocation.