Skip to content

Spatial aggregation for async instruments with filtering views #7264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

fandreuz
Copy link
Contributor

@fandreuz fandreuz commented Apr 9, 2025

Fixes #4901

I propose the following changes:

  • AsynchronousMetricStorage now declares void record(Attributes, long) and void record(Attributes, double) instead of a single void record(Measurement), to enable using AggregatorHandle within the class.
  • AsynchronousMetricStorage now contains two long fields (startEpochNanos, epochNanos) which are injected before running the callbacks, as an alternative to passing a Measurement object.
  • Some tests had to change due to the change in the interface of AsynchronousMetricStorage

@fandreuz fandreuz requested a review from a team as a code owner April 9, 2025 22:51
Copy link

linux-foundation-easycla bot commented Apr 9, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@fandreuz
Copy link
Contributor Author

fandreuz commented Apr 9, 2025

I'd like to discuss the test here. I would expect a cumulative sum instrument to aggregate all occurrences with empty attributes. Thus, I would expect to see 6 here instead of 3.

To preserve this behavior, I clear the values for all the attributes at every collection, which is why AsynchronousMetricStorageTest#collect_reusableData_reusedObjectsAreReturnedOnSecondCall is failing. I'd like to clarify this aspect before fixing it.

@jack-berg
Copy link
Member

jack-berg commented Apr 14, 2025

I'd like to discuss the test here. I would expect a cumulative sum instrument to aggregate all occurrences with empty attributes. Thus, I would expect to see 6 here instead of 3.

Yes I believe you're correct. The test / logic seem appear to be incorrect currently

Oh I forgot about a fundamental detail of asynchronous instruments: conceptually, when you make an observation with a cumulative instrument, we say that you are observing a cumulative value. Thus, if we have a reader with cumulative temporality which reads twice and invokes the callbacks twice, and the callback records the same value 3 each time, the reported value should indeed be 3 and not 6.

Consider it from the perspective of this instrumentation which records information about classes loaded. If in sequential reads the callback records that 100 classes have been loaded, the output of a cumulative reader should always be 100, not N*100 (where N is the number of times the callback has been invoked). Instrumentation is not responsible for managing state and recording only the delta.

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. This is pretty complex code when you consider the combinations of cumulative + delta, reusable + immutable data mode, and our desire to maintain zero memory allocations with reusable data mode after the application reaches steady state.

So reviews need to be slow / careful, but I am definitely interested in getting this issue resolved!

@fandreuz fandreuz force-pushed the 4901-aggregate-async-instruments branch from 00264c9 to 561e479 Compare April 16, 2025 23:11
@fandreuz
Copy link
Contributor Author

fandreuz commented Apr 16, 2025

I'd like to discuss the test here. I would expect a cumulative sum instrument to aggregate all occurrences with empty attributes. Thus, I would expect to see 6 here instead of 3.

Yes I believe you're correct. The test / logic seem appear to be incorrect currently

Oh I forgot about a fundamental detail of asynchronous instruments: conceptually, when you make an observation with a cumulative instrument, we say that you are observing a cumulative value. Thus, if we have a reader with cumulative temporality which reads twice and invokes the callbacks twice, and the callback records the same value 3 each time, the reported value should indeed be 3 and not 6.

Consider it from the perspective of this instrumentation which records information about classes loaded. If in sequential reads the callback records that 100 classes have been loaded, the output of a cumulative reader should always be 100, not N*100 (where N is the number of times the callback has been invoked). Instrumentation is not responsible for managing state and recording only the delta.

Thanks for the clarification, I fixed the test accordingly

@fandreuz
Copy link
Contributor Author

Hi @jack-berg, are the test failure something I should be concerned of? They don't seem to be related to the diff

@jack-berg
Copy link
Member

Hi @jack-berg, are the test failure something I should be concerned of? They don't seem to be related to the diff

The "windows-latest, 21" failure is unrelated, but the "ubunut-latest, 21" failure is: https://github.com/open-telemetry/opentelemetry-java/actions/runs/14524924371/job/40754298012?pr=7264#step:6:6005

This test codifies performance improvements we made with "reusuable data mode", and acts as a barrier to make sure we don't regress. If you compare the output of the benchmark on main to the output on this branch for the DELTA, REUSABLE_DATA, ASYNC_COUNTER case, we see an increase in B/op from 35_592 to 28_016_028. Ouch!

@fandreuz fandreuz force-pushed the 4901-aggregate-async-instruments branch from 3b3f85a to 29f93c5 Compare April 24, 2025 01:22
Copy link

codecov bot commented Apr 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.90%. Comparing base (8c1e348) to head (44b9825).
Report is 10 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #7264      +/-   ##
============================================
- Coverage     90.03%   89.90%   -0.14%     
+ Complexity     6916     6893      -23     
============================================
  Files           787      785       -2     
  Lines         20864    20791      -73     
  Branches       2023     2024       +1     
============================================
- Hits          18785    18692      -93     
- Misses         1441     1460      +19     
- Partials        638      639       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fandreuz
Copy link
Contributor Author

Hi @jack-berg, all green finally!

I had to make some changes in DoubleLastValueAggregator because I started getting too many allocation there as well.

The root cause is apparently the boxing due to the usage of AtomicReference<Double> within DoubleLastValueAggregator:

image

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Although you forgot one thing - you've eliminated the need for Measurement and all related code. I pushed a commit which deletes it and removes 345 lines! More functionality with less code!

@jack-berg jack-berg merged commit 7cbcdd6 into open-telemetry:main May 2, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Async instruments don't do spatial reaggregation when attributes are dropped
2 participants