POC: Compare profiles within a series, to find "novelty" #4087
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a proof-of-concept attempt in grading similarity of profiles in order to detect an anomaly in a particular workload.
For the implementation is used fairly naive approach: Take the top N contributing stack traces or function names (based on the their contribution == self, can be configured using
--dimension
) and record their proportional sizing.Then build a novelty score with for each seen profile and try to match it to those proportions, when they are matching over a particular threshold, they get merged. (I used 0.1, aka 10% match)
As sample data I queried various profiles, from Pyroscope, as simulation how one traget would send them to us. (so basically go over 15s results).
For services that are fairly large codebases and very dependent on query load (pyroscope-querier, hosted-grafana), i struggled to find novelty score under over 0.05.
For more stable workloads, like the v1 ingester, I managed to get novelty scores of up to
0.25
.I do think this approach already feels it might get very expensive and it also has quite localised (per series) stateful component to it. I do think this will be costly and its results might not be of the quality that we want to make sampling decisions on.
I think we need to take a different approach if we want to continue with this:
We need investigate an approach, where we don't have to hold individual stacktraces/functions in memory in order to compare them.
Potentially this could be something that could be a great match for using an appropriate model to create embeddings, which then could simpify the data stored per profile to a vector.
This was significantly beyond the time box that I said for this PoC and I think we might need to look at this as part of a later effort (hackathon, implementation phase) with more time.
A good summary of models that can help compare stacktrace similarty (not exactly what we want but, going into the right direction is: https://arxiv.org/pdf/2412.14802