Skip to content

storeliveness: persist latency metric #155381

@miraradeva

Description

@miraradeva

When a store has disk stall or slowness, right before sending a new heartbeat, store liveness can hang due to the disk issues. This is by design: we don't want a store with an unhealthy disk to heartbeat. The implications of this are a hung store liveness goroutine (in the SupportManager) delays in sending and receiving new messages. This is one of the most common causes for lease churn with leader leases.

We should add a latency metric to measure the time spent in writeProto (potentially broken down by caller):

func writeProto(
ctx context.Context, rw storage.ReadWriter, key roachpb.Key, msg protoutil.Message,
) error {
return storage.MVCCPutProto(ctx, rw, key, hlc.Timestamp{}, msg, storage.MVCCWriteOptions{})
}

Jira issue: CRDB-55415

Metadata

Metadata

Assignees

Labels

C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAT-kvKV Team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions