Influence Sampled logic #968

aurelijusbanelis · 2024-10-07T12:46:17Z

Current Distributed Tracing Sampling is based on "magic" (assuming "most common request is most important")

While the business is running on:

we have an anomaly, how to debug this particular page?
we do not care about 50% of information page traffic which are bots, we want to optimize checkout flow that is paying the bills

Therefore, we want to "teach" SDK what is important to sample.

Summary

Current Sampling logic is not configurable

Therefore developers have to end up with hacks.
It feels wrong paying for Observability platform and still need to have code like:

txn := newrelic.FromContext(ctx)
log := logrus.WithContext(newrelic.NewContext(ctx, txn))

if debugThisPage {
   log.WithField("url", u.String()).Debug("Outgoing call")
}

Desired Behaviour

txn := newrelic.FromContext(ctx)
txn.MarkKeyTransaction(true)

or

txn := newrelic.FromContext(ctx)
txn.PreferSampled(true)

Additional context

Same problem: Infinite Tracing will figure out which trace data is most important: https://docs.newrelic.com/docs/distributed-tracing/concepts/how-new-relic-distributed-tracing-works/#tail-based
Key transactions do not work for Debug this page, because would end on too many transactions: https://docs.newrelic.com/docs/apm/transactions/key-transactions/introduction-key-transactions/

The text was updated successfully, but these errors were encountered:

iamemilio · 2024-10-07T17:00:19Z

Hi, this is an interesting proposal. The way the agent determines how to weigh a transaction and the data within it is not tunable at the moment. We do think that it would make sense for us to give you the ability to mark something as important in the SDK if you are able to detect outside of New Relic. In general, the algorithm prioritizes 2 things: outliers (runtime, memory, etc) and errors. It is not able to "learn", and I don't think you want something running inside your application that could. I think it sounds like there are two problems here: Important transactions are not getting enough weight during sampling, and "junk" transactions seem to be getting too much weight.

How to elevate the data we want: I can see an API call that allows you to elevate a transaction being a possible solution here. We could bump the weight of that transaction up by a certain number of points so it doesn't flood your samples, but would be far more likely to be consumed.
Issues with "Junk" data crowding out other transactions: We have some questions here:

It looks like one trace type is getting selected for at a much higher rate than others. Is there an obvious reason for why that is? We'd like to understand why the top trace group seems to be possibly crowding other ones out.
Do you never want to collect certain transactions? If you know with certainty that a transaction is junk, it would make sense not to waste valuable space in a harvest sampling it.

aurelijusbanelis · 2024-10-18T10:39:14Z

How to elevate the data we want

Option 1

I assume, NewRelic use Head Sampling. So I would need to add my DebugDownstreamService as a proxy. Since DebugDownstreamService getting requests rarely, majority of requests would get sampled: true.

Option 2

Write our own sampler

samplingProbability := 0.01
if rand.Float64() < samplingProbability || requestToDebug {
	segment := txn.StartSegment(name)
	defer segment.End()
}

We have similar logic and it works most of the time (probably because anomaly detection in SDK), but has same sampling limitations.

Option 3

To not feel doing hacks, ask NewRelic community 😄

aurelijusbanelis · 2024-10-18T10:51:53Z

It looks like one trace type is getting selected for at a much higher rate than others... Is there an obvious reason for why that is?

Some are random Penetration testing calls (e.g. vjyhtnm0v/zzyvmce/iqa0of7/t2pm/dwfyzr9d/gg0fqg/j1/v2hxrczbsb)
Some are Analytics events from Mobile apps
Some are internal cron jobs (like audit logger)
Some are bot traffic (~10% that we cannot identify as bots)

Seems those are repeated and running during non-peak hours (e.g. the night).

I agree we could fix most of them with:

txn.Ignore()
IgnoreNewRelic Browse UI Bug (NewRelic UI does not filter by sampled:true) 🙈
Ignore Network overhead, and use nrqlDropRules

aurelijusbanelis · 2024-10-18T10:55:39Z

If you know with certainty that a transaction is junk

Zero results page is working perfectly fine from Engineering perspective (RAM, CPU, Error). But is the pain to debug for Business people.
We have different Tier customers (returning, frequent buyers, registered ones, new ones, friendly bots, unfriendly bots, etc). From Engineering perspective (RAM, CPU, Error) responses are identical. But from Business side: is it worth to optimize for particular user segment, or during degradation, what to fix first. It is all about Return on investment.

aurelijusbanelis added the enhancement label Oct 7, 2024

iamemilio added the PM Project Management Attention label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Influence Sampled logic #968

Influence Sampled logic #968

aurelijusbanelis commented Oct 7, 2024

iamemilio commented Oct 7, 2024

aurelijusbanelis commented Oct 18, 2024

aurelijusbanelis commented Oct 18, 2024 •

edited

Loading

aurelijusbanelis commented Oct 18, 2024 •

edited

Loading

Influence Sampled logic #968

Influence Sampled logic #968

Comments

aurelijusbanelis commented Oct 7, 2024

Summary

Desired Behaviour

Additional context

iamemilio commented Oct 7, 2024

aurelijusbanelis commented Oct 18, 2024

aurelijusbanelis commented Oct 18, 2024 • edited Loading

aurelijusbanelis commented Oct 18, 2024 • edited Loading

aurelijusbanelis commented Oct 18, 2024 •

edited

Loading

aurelijusbanelis commented Oct 18, 2024 •

edited

Loading