[RFC] Remove/hook unused metrics #2901 #5182

mdhaduk · 2025-05-05T07:32:29Z

Opening this as a Draft PR

Changes

Removed unused metrics; used this script to generally identify which metrics might not be used, then manually removed them

To identify unused metrics, I used the following convention

For SharedIncMetric:

Not incremented/added to
Not stored
Not fetched

For LatencyAggregateMetrics:

Not calling record_latency_metrics

Metrics Removed:

sync_response_fails
sync_vmm_send_timeout_count
deprecated_cmd_line_api_calls
log_fails
device_events
rx_partial_writes
tx_partial_reads

Reason

Resolving issue #2901 - Remove/hook unused metrics

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

I have read and understand CONTRIBUTING.md.
I have run tools/devtool checkstyle to verify that the PR passes the
automated style checks.
I have described what is done in these changes, why they are needed, and
how they are solving the problem in a clear and encompassing way.
I have updated any relevant documentation (both in code and in the docs)
in the PR.
I have mentioned all user-facing changes in CHANGELOG.md.
If a specific issue led to this PR, this PR closes the issue.
When making API changes, I have followed the
Runbook for Firecracker API changes.
I have tested all new and changed functionalities in unit tests and/or
integration tests.
I have linked an issue to every new TODO.

This functionality cannot be added in rust-vmm.

block throughput metrics on m8g.metal instances for test scenarios using the async fio engine and more than 1 vcpu are volatile, so exclude them from A/B-testing. Suggested-by: Riccardo Mancini <[email protected]> Signed-off-by: Patrick Roy <[email protected]> Signed-off-by: LDagnachew <[email protected]>

roypat · 2025-05-06T06:30:15Z

This is great, thanks for tackling this! I think we can probably even include your script in this PR (somewhere under tools) and then run it as part of our regular test suite, so we don't again reintroduce unneeded metrics later on :)

Btw, I think you can catch a few more unused ones with a tiny tweak to your script: Add metrics.rs files to the check that excludes specific use-sites. For example, the rx_partial_writes metric in net/metrics.rs is unused, but there's a .add call due to metrics aggregation inside the metrics files itself that makes your script believe it is in use.

script finds metric files, extracts fields, and reports validity of each metrics Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

- metric is not used Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

mdhaduk · 2025-05-06T21:24:14Z

Great! Thank you for all the feedback, and I'm super stoked to potentially have my scripted integrated within the test suite!!

I went ahead and made the change to the script as you recommended which you can see above or here. Additionally, the following metrics may or may not be removed:

metrics_count (SharedIncMetric)
metrics_fails (SharedIncMetric)
missed_metrics_count (SharedIncMetric)

I opted to ask you since removing them would require reworking src/metrics.rs and requests/metrics.rs; if you deem that is okay I will go ahead and make those changes as well!

Thanks!

codecov · 2025-05-07T08:24:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.92%. Comparing base (1ccfe46) to head (9d8ad3c).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5182      +/-   ##
==========================================
+ Coverage   82.88%   82.92%   +0.04%     
==========================================
  Files         250      250              
  Lines       26936    26927       -9     
==========================================
+ Hits        22325    22330       +5     
+ Misses       4611     4597      -14

Flag	Coverage Δ
5.10-c5n.metal	`83.36% <ø> (-0.01%)`	⬇️
5.10-m5n.metal	`83.36% <ø> (+<0.01%)`	⬆️
5.10-m6a.metal	`82.57% <ø> (-0.01%)`	⬇️
5.10-m6g.metal	`79.19% <ø> (-0.01%)`	⬇️
5.10-m6i.metal	`83.36% <ø> (-0.01%)`	⬇️
5.10-m7a.metal-48xl	`82.56% <ø> (?)`
5.10-m7g.metal	`79.19% <ø> (-0.01%)`	⬇️
5.10-m7i.metal-24xl	`83.32% <ø> (?)`
5.10-m7i.metal-48xl	`83.32% <ø> (?)`
5.10-m8g.metal-24xl	`79.18% <ø> (?)`
5.10-m8g.metal-48xl	`79.18% <ø> (?)`
6.1-c5n.metal	`83.41% <ø> (+<0.01%)`	⬆️
6.1-m5n.metal	`83.41% <ø> (-0.01%)`	⬇️
6.1-m6a.metal	`82.62% <ø> (-0.01%)`	⬇️
6.1-m6g.metal	`79.18% <ø> (-0.01%)`	⬇️
6.1-m6i.metal	`83.40% <ø> (-0.02%)`	⬇️
6.1-m7a.metal-48xl	`82.61% <ø> (?)`
6.1-m7g.metal	`79.19% <ø> (-0.01%)`	⬇️
6.1-m7i.metal-24xl	`83.41% <ø> (?)`
6.1-m7i.metal-48xl	`83.42% <ø> (?)`
6.1-m8g.metal-24xl	`79.18% <ø> (?)`
6.1-m8g.metal-48xl	`79.18% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

roypat · 2025-05-07T08:28:06Z

Ahh, I didn't realize we had files called metrics.rs that dont actually contain definition of metrics structs. How inconvenient haha. Those 3 metrics will need to stick around.

In this case, we can change

        or "metrics.rs" in path

to

        or ("metrics.rs" in path and "vmm" in path)

since all metrics definitions are in the "vmm" crate, and requests/metrics.rs is in the "firecracker" crate, and then it won't get these false-positives anymore :)

btw, I think your last push maybe combined changes from multiple branches into this PR? :D

- metric is not used Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

mdhaduk · 2025-05-07T20:21:59Z

Ahhh, I see. Thanks for pointing that out; I made the change, and the script seems to be working as intended.

And yes, sorry about that, I went ahead and cleaned up the history of this feature branch to only include changes related to the pr.

Let me know if any additional changes are needed, thank you!

roypat · 2025-05-08T08:40:25Z

Ahhh, I see. Thanks for pointing that out; I made the change, and the script seems to be working as intended.

And yes, sorry about that, I went ahead and cleaned up the history of this feature branch to only include changes related to the pr.

Let me know if any additional changes are needed, thank you!

This is great! I think we can squash all the commits about removing individual metrics into a single one. Then, could you include your script under tests/host_tools/ as part of this PR, and add a very simple test in tests/integration_tests/style that just executes the script, failing if any unused metrics are found? :o

mdhaduk · 2025-05-09T06:41:13Z

Sounds good! I will get to that as soon as I finish up some touches on the script!

So to clarify, if a metric is defined and/or aggregated within the vmm crate, but not used anywhere else (excluding tests), then that metric should be marked unused, right?

I ask because I've added a modification to account for metrics like the following that are improperly being marked as unused:

Some((&METRICS.latencies_us.load_snapshot, "load snapshot"))

The modification added rf".+{field}" as a part of the usage patterns to account for instances similar to the above example. This solves the issue, and the only remaining metrics are the following:

process_startup_time_us
process_startup_time_cpu_us
min_us
max_us
sum_us

All of which are only defined within the "vmm" crate and not used elsewhere in the codebase, so they should be removed, correct?

Let me know what you think! :D

roypat · 2025-05-09T08:17:07Z

So to clarify, if a metric is defined and/or aggregated within the vmm crate, but not used anywhere else (excluding tests), then that metric should be marked unused, right?

Yeah, if the only thing a metric is used for is aggregation, or some other logic contained within a metrics.rs file in the vmm crate, then it should be marked as unused, except...

This solves the issue, and the only remaining metrics are the following:
* process_startup_time_us

* process_startup_time_cpu_us

* min_us

* max_us

* sum_us

...these are indeed special .-. The first two only get store()'d in logger/metrics.rs, but it's not actually an aggregation, it's because metrics.rs defines a helper function for setting the metric, and this utility function is then used outside of the metrics module. Similarly the min/max/sum thingies are set in the Drop impl for LatencyMetricsRecorder. I don't think there's a general rule that would catch these, so let's just have an "allowlist" that causes the script to ignore precisely these 5 metrics.

JackThomson2 · 2025-06-04T14:32:15Z

Hey @mdhaduk are you planning to continue working on this and taking it to the finish line?

mdhaduk · 2025-06-04T17:47:01Z

Hey @JackThomson2, yes, I have been meaning to add the integration test to complete this pr, I've just been busy with my current internship; I was thinking of completing this over the weekend, is that alright?

JackThomson2 · 2025-06-05T09:49:11Z

Hey @mdhaduk of course, that's great thank you! If you need any help feel free to reach out :)

mdhaduk force-pushed the rmv_unused_metrics branch from 9990ff7 to 2cc23e7 Compare May 5, 2025 07:39

LDagnachew added 6 commits May 6, 2025 20:23

Script for finding Unused Metrics

bd6e153

script finds metric files, extracts fields, and reports validity of each metrics Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

Removed sync_response_fails metric

9dcc7a5

- metric is not used Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

Removed sync_vmm_send_timeout_count + cleaned up

309727c

- metric is not used Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

Removed deprecated_cmd_line_api_calls metric

bcd5697

- metric is not used Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

Removed log_fails metric

de9644b

- metric is not used Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

Removed device_events metric

99f5049

- metric is not used Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

mdhaduk force-pushed the rmv_unused_metrics branch from 78eac72 to d8bfb5a Compare May 6, 2025 20:28

mdhaduk marked this pull request as ready for review May 6, 2025 21:06

mdhaduk requested review from xmarcalx, kalyazin, pb8o and Manciukic as code owners May 6, 2025 21:06

roypat assigned mdhaduk May 7, 2025

Removed rx_partial_writes

02817e8

- metric is not used Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

mdhaduk force-pushed the rmv_unused_metrics branch from cdfbef8 to 02817e8 Compare May 7, 2025 19:37

Removed tx_partial_reads

1e1a61f

- metric is not used Signed-off-by: Milan Dhaduk <[email protected]> Signed-off-by: LDagnachew <[email protected]>

Merge branch 'main' into rmv_unused_metrics

9d8ad3c

zulinx86 requested a review from roypat May 14, 2025 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Remove/hook unused metrics #2901 #5182

[RFC] Remove/hook unused metrics #2901 #5182

Uh oh!

mdhaduk commented May 5, 2025 •

edited

Loading

Uh oh!

roypat commented May 6, 2025

Uh oh!

mdhaduk commented May 6, 2025

Uh oh!

codecov bot commented May 7, 2025 •

edited

Loading

Uh oh!

roypat commented May 7, 2025

Uh oh!

mdhaduk commented May 7, 2025

Uh oh!

roypat commented May 8, 2025

Uh oh!

mdhaduk commented May 9, 2025

Uh oh!

roypat commented May 9, 2025

Uh oh!

JackThomson2 commented Jun 4, 2025

Uh oh!

mdhaduk commented Jun 4, 2025

Uh oh!

JackThomson2 commented Jun 5, 2025

Uh oh!

Uh oh!

[RFC] Remove/hook unused metrics #2901 #5182

Are you sure you want to change the base?

[RFC] Remove/hook unused metrics #2901 #5182

Uh oh!

Conversation

mdhaduk commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Reason

License Acceptance

PR Checklist

Uh oh!

roypat commented May 6, 2025

Uh oh!

mdhaduk commented May 6, 2025

Uh oh!

codecov bot commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

roypat commented May 7, 2025

Uh oh!

mdhaduk commented May 7, 2025

Uh oh!

roypat commented May 8, 2025

Uh oh!

mdhaduk commented May 9, 2025

Uh oh!

roypat commented May 9, 2025

Uh oh!

JackThomson2 commented Jun 4, 2025

Uh oh!

mdhaduk commented Jun 4, 2025

Uh oh!

JackThomson2 commented Jun 5, 2025

Uh oh!

Uh oh!

mdhaduk commented May 5, 2025 •

edited

Loading

codecov bot commented May 7, 2025 •

edited

Loading