Description
The Clock Offset
graph in DB console displays a mean offset from one node to other nodes. The offsets are signed, so it's possible to distinguish a node's clock that is mostly behind or mostly in front of other nodes. Example:
The mean offset is not necessarily the best metric for analysis, for reasons:
- positive and negative offsets cancel each other out
- one skewed node messes up all nodes' offset graphs, which makes it harder to identify the outlier
We should have more comprehensive metrics.
-
For example, in addition to the mean offset, we could report a histogram, or at least a set of: min offset, max offset, 50%.
-
Also, the number of nodes participating in the computation can change dynamically. We could plot this figure as well.
-
A node can terminate itself if its clock is a 50%+ outlier from other nodes. We should make metrics that are indicative of this event coming, so that alerting can notice this situation earlier than the node kills itself. That is why something like a 50%-ile offset graph is a better indicator. Another indicator could be: the number/percent of nodes whose offset is above the threshold.
Jira issue: CRDB-33459