Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metastore client fix + admin test page #3852

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

aleks-p
Copy link
Contributor

@aleks-p aleks-p commented Jan 17, 2025

Background
We have observed metastore client latency spikes during rollouts / leadership changes.

The client relies on error responses containing the new leader, however there is a mismatch in what the client expects and what metastore nodes return.

Problem
The metastore client discovers servers using the grpc port (9095) and maintains a map with addresses such as
pyroscope-metastore-0.pyroscope-metastore-headless.profiles-dev-003:9095

Metastore nodes return the leader using the Raft port (9099) and the svc.cluster.local domain:
pyroscope-metastore-0.pyroscope-metastore-headless.profiles-dev-003.svc.cluster.local.:9099

This causes a failed check here:

it := c.servers[c.leader]
if (it == nil || override) && len(c.servers) > 0 {
idx := rand.Intn(len(c.servers))

As a result we resort to random selections until we hit the right node.

Fix
The main options are to “hack” the client and massage the data or change the Raft server identity. This PR does the former, it is however messy and ideally we should align the two components. I will give the second option a try as well.

Either way we go the latency becomes stable, however we can't go lower than 51ms because of the fixed backoff here:

backoff = 51 * time.Millisecond

For this, we can consider switching to a more aggressive and maybe exponential backoff.

Bonus
I've added a metastore client test page to make testing easier.

Screenshot 2025-01-17 at 17 53 19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant