Skip to content

Commit e93686e

Browse files
committed
feat(dcgm-exporter): add perPodGPUUtil ClusterPolicy field for time-slicing
Wires the dcgm-exporter per-pod GPU utilization feature (NVIDIA/dcgm-exporter#<PR>) into the ClusterPolicy CRD so GPU Operator users can enable it with a single field instead of manually patching DaemonSet args. ## What changes ClusterPolicy gets a new `spec.dcgmExporter.perPodGPUUtil` stanza: spec: dcgmExporter: perPodGPUUtil: enabled: true podResourcesSocketPath: /var/lib/kubelet/pod-resources/kubelet.sock When enabled, the operator automatically: - Sets DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true env var - Mounts /var/lib/kubelet/pod-resources/ as a read-only hostPath volume - Sets hostPID: true (required to resolve /proc/<pid>/cgroup) ## Why Time-slicing is configured via ClusterPolicy (spec.devicePlugin.config) but the resulting loss of per-pod GPU observability had no equivalent ClusterPolicy lever to restore it. This closes that gap. See: NVIDIA/dcgm-exporter#587 ## Files changed - api/nvidia/v1/clusterpolicy_types.go — DCGMExporterPerPodGPUUtilConfig struct, PerPodGPUUtil field on DCGMExporterSpec, helper methods, constant - api/nvidia/v1/zz_generated.deepcopy.go — deep copy for new struct - controllers/object_controls.go — wire perPodGPUUtil into DaemonSet spec - docs/dcgm-exporter-per-pod-gpu-metrics.md — usage + cost model
1 parent 09219be commit e93686e

File tree

4 files changed

+252
-0
lines changed

4 files changed

+252
-0
lines changed

api/nvidia/v1/clusterpolicy_types.go

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@ const (
3636
ClusterPolicyCRDName = "ClusterPolicy"
3737
// DefaultDCGMJobMappingDir is the default directory for DCGM Exporter HPC job mapping files
3838
DefaultDCGMJobMappingDir = "/var/lib/dcgm-exporter/job-mapping"
39+
// DefaultDCGMPodResourcesSocket is the default kubelet pod-resources socket path
40+
DefaultDCGMPodResourcesSocket = "/var/lib/kubelet/pod-resources/kubelet.sock"
3941
)
4042

4143
// ClusterPolicySpec defines the desired state of ClusterPolicy
@@ -969,6 +971,38 @@ type DCGMExporterSpec struct {
969971
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.displayName="HPC Job Mapping Configuration"
970972
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.x-descriptors="urn:alm:descriptor:com.tectonic.ui:advanced"
971973
HPCJobMapping *DCGMExporterHPCJobMappingConfig `json:"hpcJobMapping,omitempty"`
974+
975+
// Optional: Per-pod GPU utilization metrics for CUDA time-slicing workloads.
976+
// When enabled, dcgm-exporter emits dcgm_fi_dev_sm_util_per_pod gauges that
977+
// attribute SM utilization to individual pods sharing a GPU via time-slicing.
978+
// Requires dcgm-exporter v3.4.0+ built with --enable-per-pod-gpu-util support.
979+
// See: https://github.com/NVIDIA/dcgm-exporter/issues/587
980+
// +kubebuilder:validation:Optional
981+
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors=true
982+
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.displayName="Per-Pod GPU Utilization Metrics"
983+
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.x-descriptors="urn:alm:descriptor:com.tectonic.ui:advanced"
984+
PerPodGPUUtil *DCGMExporterPerPodGPUUtilConfig `json:"perPodGPUUtil,omitempty"`
985+
}
986+
987+
// DCGMExporterPerPodGPUUtilConfig configures per-pod GPU SM utilization metrics.
988+
// This feature is useful when CUDA time-slicing is active and multiple pods share
989+
// one physical GPU — standard per-device metrics lose per-workload attribution.
990+
type DCGMExporterPerPodGPUUtilConfig struct {
991+
// Enable per-pod GPU utilization collection via NVML process utilization API.
992+
// Requires hostPID: true (automatically set when enabled).
993+
// +kubebuilder:validation:Optional
994+
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors=true
995+
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.displayName="Enable Per-Pod GPU Utilization"
996+
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.x-descriptors="urn:alm:descriptor:com.tectonic.ui:booleanSwitch"
997+
Enabled *bool `json:"enabled,omitempty"`
998+
999+
// PodResourcesSocketPath is the path to the kubelet pod-resources gRPC socket.
1000+
// Defaults to /var/lib/kubelet/pod-resources/kubelet.sock.
1001+
// +kubebuilder:validation:Optional
1002+
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors=true
1003+
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.displayName="Pod Resources Socket Path"
1004+
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.x-descriptors="urn:alm:descriptor:com.tectonic.ui:text"
1005+
PodResourcesSocketPath string `json:"podResourcesSocketPath,omitempty"`
9721006
}
9731007

9741008
// DCGMExporterHPCJobMappingConfig defines HPC job mapping configuration for NVIDIA DCGM Exporter
@@ -2101,6 +2135,24 @@ func (e *DCGMExporterSpec) GetHPCJobMappingDirectory() string {
21012135
return e.HPCJobMapping.Directory
21022136
}
21032137

2138+
// IsPerPodGPUUtilEnabled returns true if per-pod GPU utilization metrics are enabled.
2139+
// This feature attributes SM utilization to individual pods when CUDA time-slicing is active.
2140+
func (e *DCGMExporterSpec) IsPerPodGPUUtilEnabled() bool {
2141+
if e.PerPodGPUUtil == nil || e.PerPodGPUUtil.Enabled == nil {
2142+
return false
2143+
}
2144+
return *e.PerPodGPUUtil.Enabled
2145+
}
2146+
2147+
// GetPerPodGPUUtilSocketPath returns the kubelet pod-resources socket path for per-pod GPU util.
2148+
// Falls back to DefaultDCGMPodResourcesSocket if not explicitly configured.
2149+
func (e *DCGMExporterSpec) GetPerPodGPUUtilSocketPath() string {
2150+
if e.PerPodGPUUtil == nil || e.PerPodGPUUtil.PodResourcesSocketPath == "" {
2151+
return DefaultDCGMPodResourcesSocket
2152+
}
2153+
return e.PerPodGPUUtil.PodResourcesSocketPath
2154+
}
2155+
21042156
// IsEnabled returns true if gpu-feature-discovery is enabled(default) through gpu-operator
21052157
func (g *GPUFeatureDiscoverySpec) IsEnabled() bool {
21062158
if g.Enabled == nil {

api/nvidia/v1/zz_generated.deepcopy.go

Lines changed: 25 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

controllers/object_controls.go

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1785,6 +1785,39 @@ func TransformDCGMExporter(obj *appsv1.DaemonSet, config *gpuv1.ClusterPolicySpe
17851785
obj.Spec.Template.Spec.Volumes = append(obj.Spec.Template.Spec.Volumes, jobMappingVol)
17861786
}
17871787

1788+
// configure per-pod GPU utilization metrics when enabled (for CUDA time-slicing workloads)
1789+
// See: https://github.com/NVIDIA/dcgm-exporter/issues/587
1790+
if config.DCGMExporter.IsPerPodGPUUtilEnabled() {
1791+
// enable the feature flag in dcgm-exporter
1792+
setContainerEnv(&(obj.Spec.Template.Spec.Containers[0]), "DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL", "true")
1793+
1794+
// resolve pod→GPU mapping via kubelet pod-resources gRPC API
1795+
socketPath := config.DCGMExporter.GetPerPodGPUUtilSocketPath()
1796+
socketDir := socketPath[:strings.LastIndex(socketPath, "/")]
1797+
1798+
podResourcesVolMount := corev1.VolumeMount{
1799+
Name: "pod-resources",
1800+
ReadOnly: true,
1801+
MountPath: socketDir,
1802+
}
1803+
obj.Spec.Template.Spec.Containers[0].VolumeMounts = append(
1804+
obj.Spec.Template.Spec.Containers[0].VolumeMounts, podResourcesVolMount)
1805+
1806+
podResourcesVol := corev1.Volume{
1807+
Name: "pod-resources",
1808+
VolumeSource: corev1.VolumeSource{
1809+
HostPath: &corev1.HostPathVolumeSource{
1810+
Path: socketDir,
1811+
Type: ptr.To(corev1.HostPathDirectory),
1812+
},
1813+
},
1814+
}
1815+
obj.Spec.Template.Spec.Volumes = append(obj.Spec.Template.Spec.Volumes, podResourcesVol)
1816+
1817+
// per-pod attribution requires resolving PIDs via /proc/<pid>/cgroup
1818+
obj.Spec.Template.Spec.HostPID = true
1819+
}
1820+
17881821
// mount configmap for custom metrics if provided by user
17891822
if config.DCGMExporter.MetricsConfig != nil && config.DCGMExporter.MetricsConfig.Name != "" {
17901823
metricsConfigVolMount := corev1.VolumeMount{Name: "metrics-config", ReadOnly: true, MountPath: MetricsConfigMountPath, SubPath: MetricsConfigFileName}
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Per-Pod GPU Utilization with DCGM Exporter (Time-Slicing)
2+
3+
## Overview
4+
5+
When GPU time-slicing is enabled via `ClusterPolicy`, multiple pods share a
6+
single physical GPU device. Standard DCGM metrics report aggregate utilization
7+
for the whole device — `dcgm_fi_dev_gpu_util` cannot distinguish how much of
8+
the GPU proxy, embeddings, or inference pods are each using.
9+
10+
GPU Operator v24.x+ integrates with dcgm-exporter's per-pod GPU utilization
11+
feature to restore workload-level attribution without requiring MIG.
12+
13+
## Prerequisite: dcgm-exporter v3.4.0+
14+
15+
This feature requires dcgm-exporter v3.4.0 or later, which adds the
16+
`--enable-per-pod-gpu-util` flag and `dcgm_fi_dev_sm_util_per_pod` metric.
17+
18+
See: [NVIDIA/dcgm-exporter#587](https://github.com/NVIDIA/dcgm-exporter/issues/587)
19+
20+
## Enabling Time-Slicing + Per-Pod Metrics
21+
22+
A complete `ClusterPolicy` for a T4 cluster running three shared workloads:
23+
24+
```yaml
25+
apiVersion: nvidia.com/v1
26+
kind: ClusterPolicy
27+
metadata:
28+
name: gpu-cluster-policy
29+
spec:
30+
# 1. Configure time-slicing: 3 virtual slices per physical GPU
31+
devicePlugin:
32+
config:
33+
name: time-slicing-config
34+
default: any
35+
36+
# 2. Enable per-pod GPU utilization metrics in dcgm-exporter
37+
dcgmExporter:
38+
perPodGPUUtil:
39+
enabled: true
40+
# Optional: custom path (default: /var/lib/kubelet/pod-resources/kubelet.sock)
41+
# podResourcesSocketPath: /var/lib/kubelet/pod-resources/kubelet.sock
42+
```
43+
44+
The time-slicing ConfigMap referenced above must be deployed separately:
45+
46+
```yaml
47+
apiVersion: v1
48+
kind: ConfigMap
49+
metadata:
50+
name: time-slicing-config
51+
namespace: gpu-operator
52+
data:
53+
any: |-
54+
version: v1
55+
flags:
56+
migStrategy: none
57+
sharing:
58+
timeSlicing:
59+
replicas: 3
60+
renameByDefault: false
61+
resources:
62+
- name: nvidia.com/gpu
63+
replicas: 3
64+
```
65+
66+
## What GPU Operator does automatically
67+
68+
When `dcgmExporter.perPodGPUUtil.enabled: true` is set, GPU Operator:
69+
70+
1. Sets `DCGM_EXPORTER_ENABLE_PER_POD_GPU_UTIL=true` in the dcgm-exporter
71+
DaemonSet environment.
72+
2. Mounts `/var/lib/kubelet/pod-resources/` as a read-only `hostPath` volume
73+
so dcgm-exporter can reach the kubelet pod-resources gRPC socket.
74+
3. Sets `hostPID: true` on the DaemonSet so dcgm-exporter can read
75+
`/proc/<pid>/cgroup` to resolve NVML PIDs to containers.
76+
77+
## Emitted metric
78+
79+
```
80+
# HELP dcgm_fi_dev_sm_util_per_pod SM utilization attributed to a pod (time-slicing)
81+
# TYPE dcgm_fi_dev_sm_util_per_pod gauge
82+
dcgm_fi_dev_sm_util_per_pod{
83+
gpu="0",
84+
uuid="GPU-abc123",
85+
pod="synapse-proxy-7f9d4b-xkz2p",
86+
namespace="synapse-staging",
87+
container="proxy"
88+
} 42
89+
dcgm_fi_dev_sm_util_per_pod{...,pod="synapse-jina-...",container="jina"} 18
90+
dcgm_fi_dev_sm_util_per_pod{...,pod="synapse-vllm-...",container="vllm"} 35
91+
```
92+
93+
## Example Prometheus alert
94+
95+
```yaml
96+
groups:
97+
- name: per-pod-gpu
98+
rules:
99+
- alert: PodGPUHighUtilization
100+
expr: dcgm_fi_dev_sm_util_per_pod > 80
101+
for: 5m
102+
labels:
103+
severity: warning
104+
annotations:
105+
summary: "{{ $labels.namespace }}/{{ $labels.pod }} using >80% GPU SM"
106+
```
107+
108+
## Cost model (example: g4dn.xlarge T4)
109+
110+
| Setup | Nodes | Cost/day |
111+
|-------|-------|----------|
112+
| 3 workloads, no time-slicing | 3 × g4dn.xlarge | ~$38/day |
113+
| 3 workloads, time-slicing (3 replicas) | 1 × g4dn.xlarge | ~$13/day |
114+
| **Savings** | | **~$25/day (~$9,000/year)** |
115+
116+
Time-slicing is appropriate for inference + embedding workloads that do not
117+
fully saturate the GPU. For compute-bound training workloads, MIG or dedicated
118+
GPUs remain the right choice.
119+
120+
## Security considerations
121+
122+
Enabling `perPodGPUUtil` grants dcgm-exporter:
123+
- Read access to `/var/lib/kubelet/pod-resources/` (lists all GPU-using pods)
124+
- Host PID namespace access (to read `/proc/<pid>/cgroup`)
125+
126+
These are the same permissions used by other node-level monitoring agents
127+
(e.g., node-exporter, cAdvisor). Review your security policy before enabling
128+
in sensitive environments.
129+
130+
## Compatibility
131+
132+
| GPU Operator | dcgm-exporter | Feature available |
133+
|-------------|---------------|-------------------|
134+
| < v24.x | any | No |
135+
| ≥ v24.x | < v3.4.0 | Field accepted but no-op |
136+
| ≥ v24.x | ≥ v3.4.0 | Yes |
137+
138+
## Related
139+
140+
- dcgm-exporter feature: [docs/per-pod-gpu-metrics.md](https://github.com/NVIDIA/dcgm-exporter/blob/main/docs/per-pod-gpu-metrics.md)
141+
- Time-slicing setup: [GPU Sharing with Time-Slicing](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html)
142+
- Issue: [NVIDIA/dcgm-exporter#587](https://github.com/NVIDIA/dcgm-exporter/issues/587)

0 commit comments

Comments
 (0)