feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy#2178
feat(dcgm-exporter): expose per-pod GPU util config in ClusterPolicy#2178zbennett10 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…licing
Wires the dcgm-exporter per-pod GPU utilization feature
(NVIDIA/dcgm-exporter#<PR>) into the ClusterPolicy CRD so GPU Operator
users can enable it with a single field instead of manually patching
DaemonSet args.
## What changes
ClusterPolicy gets a new `spec.dcgmExporter.perPodGPUUtil` stanza:
spec:
dcgmExporter:
perPodGPUUtil:
enabled: true
podResourcesSocketPath: /var/lib/kubelet/pod-resources/kubelet.sock
When enabled, the operator automatically:
- Sets DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true env var
- Mounts /var/lib/kubelet/pod-resources/ as a read-only hostPath volume
- Sets hostPID: true (required to resolve /proc/<pid>/cgroup)
## Why
Time-slicing is configured via ClusterPolicy (spec.devicePlugin.config)
but the resulting loss of per-pod GPU observability had no equivalent
ClusterPolicy lever to restore it. This closes that gap.
See: NVIDIA/dcgm-exporter#587
## Files changed
- api/nvidia/v1/clusterpolicy_types.go — DCGMExporterPerPodGPUUtilConfig
struct, PerPodGPUUtil field on DCGMExporterSpec, helper methods, constant
- api/nvidia/v1/zz_generated.deepcopy.go — deep copy for new struct
- controllers/object_controls.go — wire perPodGPUUtil into DaemonSet spec
- docs/dcgm-exporter-per-pod-gpu-metrics.md — usage + cost model
Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
e93686e to
2a834bb
Compare
|
@rajathagasthya can anyone take a look at this? Thanks |
|
@zbennett10 Has this been merged upstream in DCGM Exporter? I see NVIDIA/dcgm-exporter#638 is still open. |
Nope - still waiting for a review there. Are you able to help with that @rajathagasthya ? Thanks :) figured I'd try |
|
Sorry, we don't maintain that repo! You'd have to first get it merged there and then we can add support for it once there's a new release of DCGM Exporter. |
Sure was just wondering if you could help contact them about it since you're also from NVIDIA. A lot of radio silence over there it seems. Thanks @rajathagasthya |
|
@zbennett10 I've poked some people internally to take a look at this, so hopefully there's some action on it soon! |
Summary
Adds
spec.dcgmExporter.perPodGPUUtilto theClusterPolicyCRD, enablingper-pod GPU SM utilization metrics when CUDA time-slicing is active.
This is the GPU Operator half of a two-part contribution that closes
NVIDIA/dcgm-exporter#638
(issue: NVIDIA/dcgm-exporter#587).
The problem
With GPU time-slicing,
dcgm_fi_dev_gpu_utilreports only aggregate deviceutilization — you cannot tell how much of the GPU proxy, embeddings, or
inference pods are each consuming.
The fix
dcgm-exporter PR #638 adds an opt-in collector that uses NVML per-process
utilization + kubelet pod-resources gRPC to emit
dcgm_fi_dev_sm_util_per_podper
(pod, namespace, container, gpu_uuid)tuple.This PR wires that feature through
ClusterPolicyso users can enable itwithout hand-editing the dcgm-exporter DaemonSet:
What GPU Operator does automatically when
enabled: trueDCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=trueenv var in the dcgm-exporter DaemonSet/var/lib/kubelet/pod-resources/as a read-onlyhostPathvolumehostPID: trueon the DaemonSet so dcgm-exporter can resolve/proc/<pid>/cgroupCompatibility
Security considerations
Enabling
perPodGPUUtilgrants dcgm-exporter:/var/lib/kubelet/pod-resources/(lists all GPU-using pods)/proc/<pid>/cgroup)These are the same permissions used by other node-level monitoring agents.