Skip to content

[RFC] Remove/hook unused metrics #2901 #5182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

mdhaduk
Copy link

@mdhaduk mdhaduk commented May 5, 2025

Opening this as a Draft PR

Changes

Removed unused metrics; used this script to generally identify which metrics might not be used, then manually removed them

To identify unused metrics, I used the following convention

For SharedIncMetric:

  • Not incremented/added to
  • Not stored
  • Not fetched

For LatencyAggregateMetrics:

  • Not calling record_latency_metrics

Metrics Removed:

  • sync_response_fails
  • sync_vmm_send_timeout_count
  • deprecated_cmd_line_api_calls
  • log_fails
  • device_events
  • rx_partial_writes
  • tx_partial_reads

Reason

Resolving issue #2901 - Remove/hook unused metrics

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

block throughput metrics on m8g.metal instances for test scenarios using
the async fio engine and more than 1 vcpu are volatile, so exclude them
from A/B-testing.

Suggested-by: Riccardo Mancini <[email protected]>
Signed-off-by: Patrick Roy <[email protected]>
Signed-off-by: LDagnachew <[email protected]>
@mdhaduk mdhaduk force-pushed the rmv_unused_metrics branch from 9990ff7 to 2cc23e7 Compare May 5, 2025 07:39
@roypat
Copy link
Contributor

roypat commented May 6, 2025

This is great, thanks for tackling this! I think we can probably even include your script in this PR (somewhere under tools) and then run it as part of our regular test suite, so we don't again reintroduce unneeded metrics later on :)

Btw, I think you can catch a few more unused ones with a tiny tweak to your script: Add metrics.rs files to the check that excludes specific use-sites. For example, the rx_partial_writes metric in net/metrics.rs is unused, but there's a .add call due to metrics aggregation inside the metrics files itself that makes your script believe it is in use.

LDagnachew added 6 commits May 6, 2025 20:23
script finds metric files, extracts fields, and reports
validity of each metrics

Signed-off-by: Milan Dhaduk <[email protected]>

Signed-off-by: LDagnachew <[email protected]>
- metric is not used

Signed-off-by: Milan Dhaduk <[email protected]>

Signed-off-by: LDagnachew <[email protected]>
- metric is not used

Signed-off-by: Milan Dhaduk <[email protected]>

Signed-off-by: LDagnachew <[email protected]>
- metric is not used

Signed-off-by: Milan Dhaduk <[email protected]>

Signed-off-by: LDagnachew <[email protected]>
- metric is not used

Signed-off-by: Milan Dhaduk <[email protected]>

Signed-off-by: LDagnachew <[email protected]>
- metric is not used

Signed-off-by: Milan Dhaduk <[email protected]>

Signed-off-by: LDagnachew <[email protected]>
@mdhaduk mdhaduk force-pushed the rmv_unused_metrics branch from 78eac72 to d8bfb5a Compare May 6, 2025 20:28
@mdhaduk mdhaduk marked this pull request as ready for review May 6, 2025 21:06
@mdhaduk
Copy link
Author

mdhaduk commented May 6, 2025

Great! Thank you for all the feedback, and I'm super stoked to potentially have my scripted integrated within the test suite!!

I went ahead and made the change to the script as you recommended which you can see above or here. Additionally, the following metrics may or may not be removed:

  • metrics_count (SharedIncMetric)
  • metrics_fails (SharedIncMetric)
  • missed_metrics_count (SharedIncMetric)

I opted to ask you since removing them would require reworking src/metrics.rs and requests/metrics.rs; if you deem that is okay I will go ahead and make those changes as well!

Thanks!

Copy link

codecov bot commented May 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.92%. Comparing base (1ccfe46) to head (9d8ad3c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5182      +/-   ##
==========================================
+ Coverage   82.88%   82.92%   +0.04%     
==========================================
  Files         250      250              
  Lines       26936    26927       -9     
==========================================
+ Hits        22325    22330       +5     
+ Misses       4611     4597      -14     
Flag Coverage Δ
5.10-c5n.metal 83.36% <ø> (-0.01%) ⬇️
5.10-m5n.metal 83.36% <ø> (+<0.01%) ⬆️
5.10-m6a.metal 82.57% <ø> (-0.01%) ⬇️
5.10-m6g.metal 79.19% <ø> (-0.01%) ⬇️
5.10-m6i.metal 83.36% <ø> (-0.01%) ⬇️
5.10-m7a.metal-48xl 82.56% <ø> (?)
5.10-m7g.metal 79.19% <ø> (-0.01%) ⬇️
5.10-m7i.metal-24xl 83.32% <ø> (?)
5.10-m7i.metal-48xl 83.32% <ø> (?)
5.10-m8g.metal-24xl 79.18% <ø> (?)
5.10-m8g.metal-48xl 79.18% <ø> (?)
6.1-c5n.metal 83.41% <ø> (+<0.01%) ⬆️
6.1-m5n.metal 83.41% <ø> (-0.01%) ⬇️
6.1-m6a.metal 82.62% <ø> (-0.01%) ⬇️
6.1-m6g.metal 79.18% <ø> (-0.01%) ⬇️
6.1-m6i.metal 83.40% <ø> (-0.02%) ⬇️
6.1-m7a.metal-48xl 82.61% <ø> (?)
6.1-m7g.metal 79.19% <ø> (-0.01%) ⬇️
6.1-m7i.metal-24xl 83.41% <ø> (?)
6.1-m7i.metal-48xl 83.42% <ø> (?)
6.1-m8g.metal-24xl 79.18% <ø> (?)
6.1-m8g.metal-48xl 79.18% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@roypat
Copy link
Contributor

roypat commented May 7, 2025

Ahh, I didn't realize we had files called metrics.rs that dont actually contain definition of metrics structs. How inconvenient haha. Those 3 metrics will need to stick around.

In this case, we can change

        or "metrics.rs" in path

to

        or ("metrics.rs" in path and "vmm" in path)

since all metrics definitions are in the "vmm" crate, and requests/metrics.rs is in the "firecracker" crate, and then it won't get these false-positives anymore :)

btw, I think your last push maybe combined changes from multiple branches into this PR? :D

- metric is not used

Signed-off-by: Milan Dhaduk <[email protected]>
Signed-off-by: LDagnachew <[email protected]>
@mdhaduk mdhaduk force-pushed the rmv_unused_metrics branch from cdfbef8 to 02817e8 Compare May 7, 2025 19:37
- metric is not used

Signed-off-by: Milan Dhaduk <[email protected]>
Signed-off-by: LDagnachew <[email protected]>
@mdhaduk
Copy link
Author

mdhaduk commented May 7, 2025

Ahhh, I see. Thanks for pointing that out; I made the change, and the script seems to be working as intended.

And yes, sorry about that, I went ahead and cleaned up the history of this feature branch to only include changes related to the pr.

Let me know if any additional changes are needed, thank you!

@roypat
Copy link
Contributor

roypat commented May 8, 2025

Ahhh, I see. Thanks for pointing that out; I made the change, and the script seems to be working as intended.

And yes, sorry about that, I went ahead and cleaned up the history of this feature branch to only include changes related to the pr.

Let me know if any additional changes are needed, thank you!

This is great! I think we can squash all the commits about removing individual metrics into a single one. Then, could you include your script under tests/host_tools/ as part of this PR, and add a very simple test in tests/integration_tests/style that just executes the script, failing if any unused metrics are found? :o

@mdhaduk
Copy link
Author

mdhaduk commented May 9, 2025

Sounds good! I will get to that as soon as I finish up some touches on the script!

So to clarify, if a metric is defined and/or aggregated within the vmm crate, but not used anywhere else (excluding tests), then that metric should be marked unused, right?

I ask because I've added a modification to account for metrics like the following that are improperly being marked as unused:

Some((&METRICS.latencies_us.load_snapshot, "load snapshot"))

The modification added rf".+{field}" as a part of the usage patterns to account for instances similar to the above example. This solves the issue, and the only remaining metrics are the following:

  • process_startup_time_us
  • process_startup_time_cpu_us
  • min_us
  • max_us
  • sum_us

All of which are only defined within the "vmm" crate and not used elsewhere in the codebase, so they should be removed, correct?

Let me know what you think! :D

@roypat
Copy link
Contributor

roypat commented May 9, 2025

So to clarify, if a metric is defined and/or aggregated within the vmm crate, but not used anywhere else (excluding tests), then that metric should be marked unused, right?

Yeah, if the only thing a metric is used for is aggregation, or some other logic contained within a metrics.rs file in the vmm crate, then it should be marked as unused, except...

This solves the issue, and the only remaining metrics are the following:

* process_startup_time_us

* process_startup_time_cpu_us

* min_us

* max_us

* sum_us

...these are indeed special .-. The first two only get store()'d in logger/metrics.rs, but it's not actually an aggregation, it's because metrics.rs defines a helper function for setting the metric, and this utility function is then used outside of the metrics module. Similarly the min/max/sum thingies are set in the Drop impl for LatencyMetricsRecorder. I don't think there's a general rule that would catch these, so let's just have an "allowlist" that causes the script to ignore precisely these 5 metrics.

@zulinx86 zulinx86 requested a review from roypat May 14, 2025 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants