Skip to content

Conversation

@mu8086
Copy link

@mu8086 mu8086 commented Jul 23, 2025

  • Allow configuring a custom runtimeClass.name to avoid conflict with NVIDIA's default runtime
  • Topology server namespace in nvidia-smi is now configurable via the TOPOLOGY_CM_NAMESPACE environment variable instead of being hardcoded to gpu-operator

Example Helm upgrade command

helm upgrade --install fake-gpu-operator ~/git/fake-gpu-operator/deploy/fake-gpu-operator \
  --namespace runai --create-namespace \
  --set runtimeClass.name=fake-nvidia

Example: Verified Pod Spec

This Pod verifies that the custom runtimeClass and dynamic topology namespace injection works correctly.

Click to expand pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: "1"
spec:
  runtimeClassName: fake-nvidia
  containers:
  - name: ubuntu
    image: ubuntu:22.04
    command: ["/bin/bash", "-c"]
    args:
      - |
        sleep infinity;
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
      - name: NODE_NAME
        valueFrom:
          fieldRef:
            fieldPath: spec.nodeName

…espace

- Allow configuring a custom `runtimeClass.name` to avoid conflict with NVIDIA's default runtime
- Topology server namespace in `nvidia-smi` is now configurable via the `TOPOLOGY_CM_NAMESPACE` environment variable
  instead of being hardcoded to `gpu-operator`

### Example Helm upgrade command

```bash
helm upgrade --install fake-gpu-operator ~/git/fake-gpu-operator/deploy/fake-gpu-operator \
  --namespace runai --create-namespace \
  --set runtimeClass.name=fake-gpu
@mu8086 mu8086 changed the title feat(fake-gpu-operator): support custom runtimeClass and topology nam… feat: support custom runtimeClass and topology nam… Jul 23, 2025
@mu8086 mu8086 changed the title feat: support custom runtimeClass and topology nam… feat: support custom runtimeClass and topology namespace Jul 23, 2025
Copy link
Contributor

@gshaibi gshaibi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @mu8086 for your contribution!

From what I understand from your comment, you wish to run both the Fake GPU Operator and the original one together on the same cluster.
Unfortunately this is not supported yet.
I'd love to hear more about this use case.

Regardless, configuring the RuntimeClass name and respecting the release namespace in nvidia-smi seems reasonable - I left a couple of comments.


// Send http request to topology-server to get the topology
topologyUrl := "http://topology-server.gpu-operator/topology/nodes/" + nodeName
topologyUrl := fmt.Sprintf("http://topology-server.%s/topology/nodes/%s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please inject and use a FAKE_GPU_OPERATOR_NAMESPACE instead

kind: RuntimeClass
metadata:
name: nvidia
name: {{ .Values.runtimeClass.name | default "fake-nvidia" }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we keep the default nvidia to better fake the Nvidia GPU Operator behavior.

COMPONENTS?=device-plugin status-updater kwok-gpu-device-plugin status-exporter topology-server mig-faker jupyter-notebook

DOCKER_REPO_BASE=gcr.io/run-ai-lab/fake-gpu-operator
DOCKER_REPO_BASE?=gcr.io/run-ai-lab/fake-gpu-operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants