feat: add support for gpu sharing metrics in k8s #432

pintohutch · 2024-12-12T04:47:24Z

We add support for capturing separate metrics when running dcgm-exporter in K8s clusters that have GPU sharing enabled, including time-sharing and MPS. This should now support GPU sharing on MIG devices as well.

We ensure this is supported for both the NVIDIA and GKE device plugins, respectively at:

The bulk of the change is guarded by a new configuration parameter, which can be passed in as a flag --kubernetes-virtual-gpus or as an environment variable KUBERNETES_VIRTUAL_GPUS. If set, the Kubernetes PodMapper tranform processor uses a different mechanism to build the device mapping, which creates a copy of the metric for every shared (i.e. virtual) GPU exposed by the device plugin. To disambiguate the generated timeseries, it adds a new label "vgpu" set to the detected shared GPU replica.

This also fixes an issue where pod attributes are not guaranteed to be consistently associated with the same metric. If the podresources API does not consistently return the device-ids in the same order between calls, the device-to-pod association in the map can change between scrapes due to an overwrite that happens in the Process loop.

Ultimately, we may wish to make this the default behavior. However, guarding it behind a flag:

Mitigates any risk of the change in case of bugs
Given the feature adds a new label, it is possible PromQL queries performing deaggregation using, e.g. without or ignore functions, may break existing dashboards and alerts. Allowing users to opt-in via a flag ensures backwards compatibility in these scenarios.

Finally, we update the unit tests to ensure thorough coverage for the changes.

pintohutch · 2024-12-12T13:18:01Z

Fixes #307

glowkey · 2024-12-12T14:44:10Z

Thank you for the contribution! We will test and review the PR in the coming weeks.

Please be aware that we are on the verge of releasing a new 4.0 version of DCGM-Exporter so this PR will likely need to be updated after that happens.

pintohutch · 2024-12-13T20:04:18Z

Thanks @glowkey - I actually may factor out the smaller bugfix in a separate PR:

First, we make a small fix to the Kubernetes PodMapper tranform processor. Specifically we update the regular expression used in building the device mapping to properly capture pod attributes in both MIG and MIG-with-sharing GPUs in GKE.

That way, this change just focuses on the new KUBERNETES_VIRTUAL_GPUS addition.

pintohutch · 2024-12-18T16:11:12Z

Moved the bugfix fix over to #433

We add support for capturing separate metrics when running dcgm-exporter in K8s clusters that have GPU sharing enabled, including time-sharing and MPS. This should now support GPU sharing on MIG devices as well. We ensure this is supported for both the NVIDIA and GKE device plugins, respectively at: * https://github.com/NVIDIA/k8s-device-plugin * https://github.com/GoogleCloudPlatform/container-engine-accelerators The change is guarded by a new configuration parameter, which can be passed in as a flag `--kubernetes-virtual-gpus` or as an environment variable `KUBERNETES_VIRTUAL_GPUS`. If set, the Kubernetes PodMapper tranform processor uses a different mechanism to build the device mapping, which creates a copy of the metric for every shared (i.e. virtual) GPU exposed by the device plugin. To disambiguate the generated timeseries, it adds a new label "vgpu" set to the detected shared GPU replica. This also fixes an issue where pod attributes are not guaranteed to be consistently associated with the same metric. If the podresources API does not consistently return the device-ids in the same order between calls, the device-to-pod association in the map can change between scrapes due to an overwrite that happens in the Process loop. Ultimately, we may wish to make this the default behavior. However, guarding it behind a flag: 1. Mitigates any risk of the change in case of bugs 2. Given the feature adds a new label, it is possible PromQL queries performing deaggregation using, e.g. `without` or `ignore` functions, may break existing dashboards and alerts. Allowing users to opt-in via a flag ensures backwards compatibility in these scenarios. Finally, we update the unit tests to ensure thorough coverage for the changes.

pintohutch force-pushed the vgpu-fix branch 3 times, most recently from 1808a2b to 61ed255 Compare January 13, 2025 19:18

pintohutch force-pushed the vgpu-fix branch from 55a7d60 to 7103167 Compare January 13, 2025 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for gpu sharing metrics in k8s #432

feat: add support for gpu sharing metrics in k8s #432

pintohutch commented Dec 12, 2024 •

edited

Loading

pintohutch commented Dec 12, 2024 •

edited

Loading

glowkey commented Dec 12, 2024

pintohutch commented Dec 13, 2024

pintohutch commented Dec 18, 2024

feat: add support for gpu sharing metrics in k8s #432

Are you sure you want to change the base?

feat: add support for gpu sharing metrics in k8s #432

Conversation

pintohutch commented Dec 12, 2024 • edited Loading

pintohutch commented Dec 12, 2024 • edited Loading

glowkey commented Dec 12, 2024

pintohutch commented Dec 13, 2024

pintohutch commented Dec 18, 2024

pintohutch commented Dec 12, 2024 •

edited

Loading

pintohutch commented Dec 12, 2024 •

edited

Loading