Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcgm-exporter counter value goes down #417

Open
luccabb opened this issue Nov 14, 2024 · 1 comment
Open

dcgm-exporter counter value goes down #417

luccabb opened this issue Nov 14, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@luccabb
Copy link

luccabb commented Nov 14, 2024

What is the version?

3.1.3-3.1.2

What happened?

I'm tracking DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, this value is marked as a counter. but I'm observing the value going down, i.e.:

timestamp t
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="1"...} 57


timestamp t + 1
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="1"...} 21

I'm also seeing this with other metrics that are reported as counters here. Is this expected behavior for counters?

What did you expect to happen?

I'ld expect for counters to only go up.

Do not use a counter to expose a value that can decrease.

source: https://prometheus.io/docs/concepts/metric_types/#counter

What is the GPU model?

A100-SXM4-80GB

What is the environment?

bare metal

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

@luccabb luccabb added the bug Something isn't working label Nov 14, 2024
@luccabb
Copy link
Author

luccabb commented Nov 14, 2024

if this is expected behavior, should we change the type to gauge?

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

https://prometheus.io/docs/concepts/metric_types/#gauge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant