Skip to content

server: improve clock offset monitoring #114321

Open
@pav-kv

Description

@pav-kv

The Clock Offset graph in DB console displays a mean offset from one node to other nodes. The offsets are signed, so it's possible to distinguish a node's clock that is mostly behind or mostly in front of other nodes. Example:

Screenshot 2023-10-02 at 17 05 38

The mean offset is not necessarily the best metric for analysis, for reasons:

  • positive and negative offsets cancel each other out
  • one skewed node messes up all nodes' offset graphs, which makes it harder to identify the outlier

We should have more comprehensive metrics.

  1. For example, in addition to the mean offset, we could report a histogram, or at least a set of: min offset, max offset, 50%.

  2. Also, the number of nodes participating in the computation can change dynamically. We could plot this figure as well.

  3. A node can terminate itself if its clock is a 50%+ outlier from other nodes. We should make metrics that are indicative of this event coming, so that alerting can notice this situation earlier than the node kills itself. That is why something like a 50%-ile offset graph is a better indicator. Another indicator could be: the number/percent of nodes whose offset is above the threshold.

Jira issue: CRDB-33459

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-cluster-observabilityRelated to cluster observabilityC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAT-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions