-
Notifications
You must be signed in to change notification settings - Fork 37
feat: support custom runtimeClass and topology namespace #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…espace - Allow configuring a custom `runtimeClass.name` to avoid conflict with NVIDIA's default runtime - Topology server namespace in `nvidia-smi` is now configurable via the `TOPOLOGY_CM_NAMESPACE` environment variable instead of being hardcoded to `gpu-operator` ### Example Helm upgrade command ```bash helm upgrade --install fake-gpu-operator ~/git/fake-gpu-operator/deploy/fake-gpu-operator \ --namespace runai --create-namespace \ --set runtimeClass.name=fake-gpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @mu8086 for your contribution!
From what I understand from your comment, you wish to run both the Fake GPU Operator and the original one together on the same cluster.
Unfortunately this is not supported yet.
I'd love to hear more about this use case.
Regardless, configuring the RuntimeClass name and respecting the release namespace in nvidia-smi seems reasonable - I left a couple of comments.
|
|
||
| // Send http request to topology-server to get the topology | ||
| topologyUrl := "http://topology-server.gpu-operator/topology/nodes/" + nodeName | ||
| topologyUrl := fmt.Sprintf("http://topology-server.%s/topology/nodes/%s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please inject and use a FAKE_GPU_OPERATOR_NAMESPACE instead
| kind: RuntimeClass | ||
| metadata: | ||
| name: nvidia | ||
| name: {{ .Values.runtimeClass.name | default "fake-nvidia" }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we keep the default nvidia to better fake the Nvidia GPU Operator behavior.
| COMPONENTS?=device-plugin status-updater kwok-gpu-device-plugin status-exporter topology-server mig-faker jupyter-notebook | ||
|
|
||
| DOCKER_REPO_BASE=gcr.io/run-ai-lab/fake-gpu-operator | ||
| DOCKER_REPO_BASE?=gcr.io/run-ai-lab/fake-gpu-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
runtimeClass.nameto avoid conflict with NVIDIA's default runtimenvidia-smiis now configurable via theTOPOLOGY_CM_NAMESPACEenvironment variable instead of being hardcoded togpu-operatorExample Helm upgrade command
helm upgrade --install fake-gpu-operator ~/git/fake-gpu-operator/deploy/fake-gpu-operator \ --namespace runai --create-namespace \ --set runtimeClass.name=fake-nvidiaExample: Verified Pod Spec
This Pod verifies that the custom runtimeClass and dynamic topology namespace injection works correctly.
Click to expand pod.yaml