Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: fix gateway metrics #5483

Open
wants to merge 2 commits into
base: release/1.42.x
Choose a base branch
from
Open

Conversation

mihir20
Copy link
Contributor

@mihir20 mihir20 commented Feb 6, 2025

Description

We iterate on the jobs and for each job we increase the events by 1 like this:

for _, jws := range jobsWithStats {
	jws.stat.RequestEventsSucceeded(1)

The problem is that, whenever you call RequestEventsSucceeded it is also increasing the requests:

func (ss *SourceStat) RequestEventsSucceeded(num int) {
	ss.events.succeeded += num
	ss.events.total += num
	ss.requests.total++
	ss.requests.succeeded++
}

So we should increment events and requests separately. there are some tags like sourceID which are required for events but do not makes sense for internalBatch request. So introducing new methods EventsSuccess EventsFailed to increment events and requests separately.

Linear Ticket

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@mihir20 mihir20 marked this pull request as ready for review February 6, 2025 11:50
Copy link

codecov bot commented Feb 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.90%. Comparing base (9bf77b4) to head (1769a3e).
Report is 1 commits behind head on release/1.42.x.

Additional details and impacted files
@@                Coverage Diff                 @@
##           release/1.42.x    #5483      +/-   ##
==================================================
- Coverage           75.03%   74.90%   -0.14%     
==================================================
  Files                 458      458              
  Lines               63266    63279      +13     
==================================================
- Hits                47470    47396      -74     
- Misses              13160    13227      +67     
- Partials             2636     2656      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@fracasula fracasula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved given that you didn't introduce the issue but do keep it in mind please since you also review other people's code, it's important that we design code that is decoupled and efficient. So please do keep an eye out for these issues when you review as well.

Comment on lines -678 to +682
jws.stat.RequestEventsFailed(1, "storeFailed")
jws.stat.EventsFailed(1, "storeFailed")
jws.stat.Report(gw.stats)
}
stat.RequestFailed("storeFailed")
stat.Report(gw.stats)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole approach of jobs with stats embedded smells and it's starting to become a problem.

Calling Report each time in a loop doesn't seem like a good idea. I know it was already there but I had missed the first time. As already mentioned here we have to be especially careful about what we're doing in loops.

Report is calling NewTaggedStats which we know uses mutexes under the hood. So imagine doing this for a batch of thousands. We're paying a substantial performance penalty because of bad designed code.

I think this should work the other way around. Instead of having jobs with stats we should have a stats-aware component where you just feed it jobs and their status and then report once at the end.

stats.AddFailedEvent(job, "storeFailed") // gets workspaceID, sourceID from job
stats.AddFailedEvent(job, "storeFailed") // gets workspaceID, sourceID from job
stats.AddFailedEvent(job, "storeFailed") // gets workspaceID, sourceID from job
stats.Report() // stats could have kept maps of tags according to labels like workspaceID etc... and then it reports once per Tags

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants