-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Labels
C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAIssues/test failures with no fix SLAT-kvKV TeamKV Team
Description
When a store has disk stall or slowness, right before sending a new heartbeat, store liveness can hang due to the disk issues. This is by design: we don't want a store with an unhealthy disk to heartbeat. The implications of this are a hung store liveness goroutine (in the SupportManager
) delays in sending and receiving new messages. This is one of the most common causes for lease churn with leader leases.
We should add a latency metric to measure the time spent in writeProto
(potentially broken down by caller):
cockroach/pkg/kv/kvserver/storeliveness/persist.go
Lines 91 to 95 in c83c57d
func writeProto( | |
ctx context.Context, rw storage.ReadWriter, key roachpb.Key, msg protoutil.Message, | |
) error { | |
return storage.MVCCPutProto(ctx, rw, key, hlc.Timestamp{}, msg, storage.MVCCWriteOptions{}) | |
} |
Jira issue: CRDB-55415
Metadata
Metadata
Assignees
Labels
C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAIssues/test failures with no fix SLAT-kvKV TeamKV Team