feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy by zbennett10 · Pull Request #2178 · NVIDIA/gpu-operator

zbennett10 · 2026-03-02T02:07:58Z

Summary

Adds spec.dcgmExporter.perPodGPUUtil to the ClusterPolicy CRD, enabling
per-pod GPU SM utilization metrics when CUDA time-slicing is active.

This is the GPU Operator half of a two-part contribution that closes
NVIDIA/dcgm-exporter#638
(issue: NVIDIA/dcgm-exporter#587).

The problem

With GPU time-slicing, dcgm_fi_dev_gpu_util reports only aggregate device
utilization — you cannot tell how much of the GPU proxy, embeddings, or
inference pods are each consuming.

The fix

dcgm-exporter PR #638 adds an opt-in collector that uses NVML per-process
utilization + kubelet pod-resources gRPC to emit dcgm_fi_dev_sm_util_per_pod
per (pod, namespace, container, gpu_uuid) tuple.

This PR wires that feature through ClusterPolicy so users can enable it
without hand-editing the dcgm-exporter DaemonSet:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  dcgmExporter:
    perPodGPUUtil:
      enabled: true
      # podResourcesSocketPath defaults to /var/lib/kubelet/pod-resources/kubelet.sock

What GPU Operator does automatically when `enabled: true`

Sets DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true env var in the dcgm-exporter DaemonSet
Mounts /var/lib/kubelet/pod-resources/ as a read-only hostPath volume
Sets hostPID: true on the DaemonSet so dcgm-exporter can resolve /proc/<pid>/cgroup

Compatibility

GPU Operator	dcgm-exporter	Feature available
< v24.x	any	No
≥ v24.x	< v3.4.0	Field accepted but no-op
≥ v24.x	≥ v3.4.0	Yes

Security considerations

Enabling perPodGPUUtil grants dcgm-exporter:

Read access to /var/lib/kubelet/pod-resources/ (lists all GPU-using pods)
Host PID namespace access (to read /proc/<pid>/cgroup)

These are the same permissions used by other node-level monitoring agents.

copy-pr-bot · 2026-03-02T02:08:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…licing Wires the dcgm-exporter per-pod GPU utilization feature (NVIDIA/dcgm-exporter#<PR>) into the ClusterPolicy CRD so GPU Operator users can enable it with a single field instead of manually patching DaemonSet args. ## What changes ClusterPolicy gets a new `spec.dcgmExporter.perPodGPUUtil` stanza: spec: dcgmExporter: perPodGPUUtil: enabled: true podResourcesSocketPath: /var/lib/kubelet/pod-resources/kubelet.sock When enabled, the operator automatically: - Sets DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true env var - Mounts /var/lib/kubelet/pod-resources/ as a read-only hostPath volume - Sets hostPID: true (required to resolve /proc/<pid>/cgroup) ## Why Time-slicing is configured via ClusterPolicy (spec.devicePlugin.config) but the resulting loss of per-pod GPU observability had no equivalent ClusterPolicy lever to restore it. This closes that gap. See: NVIDIA/dcgm-exporter#587 ## Files changed - api/nvidia/v1/clusterpolicy_types.go — DCGMExporterPerPodGPUUtilConfig struct, PerPodGPUUtil field on DCGMExporterSpec, helper methods, constant - api/nvidia/v1/zz_generated.deepcopy.go — deep copy for new struct - controllers/object_controls.go — wire perPodGPUUtil into DaemonSet spec - docs/dcgm-exporter-per-pod-gpu-metrics.md — usage + cost model Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>

zbennett10 · 2026-03-31T21:48:43Z

@rajathagasthya can anyone take a look at this? Thanks

rajathagasthya · 2026-03-31T23:36:55Z

@zbennett10 Has this been merged upstream in DCGM Exporter? I see NVIDIA/dcgm-exporter#638 is still open.

zbennett10 · 2026-04-01T20:07:04Z

@zbennett10 Has this been merged upstream in DCGM Exporter? I see NVIDIA/dcgm-exporter#638 is still open.

Nope - still waiting for a review there. Are you able to help with that @rajathagasthya ? Thanks :) figured I'd try

rajathagasthya · 2026-04-01T21:30:02Z

Sorry, we don't maintain that repo! You'd have to first get it merged there and then we can add support for it once there's a new release of DCGM Exporter.

zbennett10 · 2026-04-02T00:01:06Z

Sorry, we don't maintain that repo! You'd have to first get it merged there and then we can add support for it once there's a new release of DCGM Exporter.

Sure was just wondering if you could help contact them about it since you're also from NVIDIA. A lot of radio silence over there it seems. Thanks @rajathagasthya

rajathagasthya · 2026-04-02T14:51:51Z

@zbennett10 I've poked some people internally to take a look at this, so hopefully there's some action on it soon!

zbennett10 requested review from cdesiniotis, karthikvetrivel, rahulait, rajathagasthya, shivamerla and tariq1890 as code owners March 2, 2026 02:07

zbennett10 marked this pull request as draft March 2, 2026 12:46

zbennett10 marked this pull request as ready for review March 2, 2026 14:46

zbennett10 mentioned this pull request Mar 2, 2026

feat: add per-pod GPU SM utilization metrics for time-slicing workloads NVIDIA/dcgm-exporter#637

Closed

3 tasks

zbennett10 force-pushed the feat/dcgm-exporter-per-pod-gpu-util branch from e93686e to 2a834bb Compare March 2, 2026 15:18

zbennett10 mentioned this pull request Mar 2, 2026

feat: add per-pod GPU SM utilization metrics for time-slicing workloads NVIDIA/dcgm-exporter#638

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy#2178

feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy#2178
zbennett10 wants to merge 1 commit intoNVIDIA:mainfrom
zbennett10:feat/dcgm-exporter-per-pod-gpu-util

zbennett10 commented Mar 2, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 2, 2026

Uh oh!

zbennett10 commented Mar 31, 2026

Uh oh!

rajathagasthya commented Mar 31, 2026

Uh oh!

zbennett10 commented Apr 1, 2026 •

edited

Loading

Uh oh!

rajathagasthya commented Apr 1, 2026

Uh oh!

zbennett10 commented Apr 2, 2026

Uh oh!

rajathagasthya commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zbennett10 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The problem

The fix

What GPU Operator does automatically when enabled: true

Compatibility

Security considerations

Uh oh!

copy-pr-bot bot commented Mar 2, 2026

Uh oh!

zbennett10 commented Mar 31, 2026

Uh oh!

rajathagasthya commented Mar 31, 2026

Uh oh!

zbennett10 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rajathagasthya commented Apr 1, 2026

Uh oh!

zbennett10 commented Apr 2, 2026

Uh oh!

rajathagasthya commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zbennett10 commented Mar 2, 2026 •

edited

Loading

What GPU Operator does automatically when `enabled: true`

zbennett10 commented Apr 1, 2026 •

edited

Loading