Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch current container runtime config #686

Merged
merged 1 commit into from
Oct 10, 2024
Merged

Conversation

tariq1890
Copy link
Contributor

@tariq1890 tariq1890 commented Sep 9, 2024

Summary of changes made in this PR:

  1. In CRI-O and Containerd config engines, retrieve the container runtime config commands via the following commands:
    i. crio status config
    ii. containerd config dump

  2. When configuring CRI-O runtime, we prioritise the runtime designated as the default_runtime in the config when setting up the nvidia runtime handler as opposed to favouring the runc runtime handler at all times. This is needed as vanilla cri-o packages have crun as the default low-level-runtime. Live-swapping the low-level runtime causes the running containers in a cluster to break

  3. Add the full path of the low-level runtime to the nvidia-container-runtime.runtimes config value, so that nvidia-container-runtime binds to a low-level runtime binary that cannot be resolved in the PATH.

@tariq1890 tariq1890 force-pushed the get-config-from-cmdline branch 3 times, most recently from 1a5292f to 56b7927 Compare September 10, 2024 17:41
@tariq1890 tariq1890 force-pushed the get-config-from-cmdline branch 8 times, most recently from 310ee8a to adf9026 Compare September 19, 2024 03:16
pkg/config/engine/api.go Outdated Show resolved Hide resolved
pkg/config/source.go Outdated Show resolved Hide resolved
pkg/config/source.go Outdated Show resolved Hide resolved
pkg/config/toml.go Outdated Show resolved Hide resolved
pkg/config/engine/api.go Outdated Show resolved Hide resolved
Comment on lines 456 to 458
toolkitRuntimeList := getNvidiaContainerRuntimeList(cfg, opts.ContainerRuntimeRuntimes.Value())
if len(toolkitRuntimeList) > 0 {
configValues["nvidia-container-runtime.runtimes"] = toolkitRuntimeList
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should no longer be required.

We will ONLY set the binaries from the runtimes. Note that if opts.ContainerRuntimeRuntimes is set we will overwrite the value we set here as part of processing the "options" config options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will ONLY set the binaries from the runtimes.

I don't think we can do that. If you look at the default containerd config toml, the BinaryName fields are empty, meaning that the PATH-resolvable low-level-runtime binary paths will be used. So we still need the default list ["docker-runc", "runc", "crun"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we implement it so that if the slice is empty we don't update the config. Alternatively see the comment below.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tariq1890.

I think this has come together very nicely. I have some additional comments.

@elezar
Copy link
Member

elezar commented Oct 9, 2024

Thanks for the work on this @tariq1890.

One question that I had was whether we should update how we apply the modified config. At present we're going to generate a file that includes the full config. This may not be desirable when we are using drop-in files in crio, for example. I don't think it's a blocker for this iteration, but it would be good to put some thought into how the system will behave under these conditions.

@tariq1890 tariq1890 force-pushed the get-config-from-cmdline branch 3 times, most recently from a0ddda5 to 6266ef3 Compare October 9, 2024 19:51
@tariq1890
Copy link
Contributor Author

At present we're going to generate a file that includes the full config. This may not be desirable when we are using drop-in files in crio, for example. I don't think it's a blocker for this iteration, but it would be good to put some thought into how the system will behave under these conditions.

Agreed, one way is to modify the command line config source to only return the runtimes config block and everything under it. Although, this could easily apply to every other config source

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience @tariq1890.

Let's get this in!

Signed-off-by: Tariq Ibrahim <[email protected]>

add default runtime binary path to runtimes field of toolkit config toml

Signed-off-by: Tariq Ibrahim <[email protected]>

[no-relnote] Get low-level runtimes consistently

We ensure that we use the same low-level runtimes regardless
of the runtime engine being configured. This ensures consistent
behaviour.

Signed-off-by: Evan Lezar <[email protected]>

Co-authored-by: Evan Lezar <[email protected]>

address review comment

Signed-off-by: Tariq Ibrahim <[email protected]>
@tariq1890 tariq1890 force-pushed the get-config-from-cmdline branch from 3e6c2b7 to f477dc0 Compare October 10, 2024 08:13
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@elezar elezar merged commit a06d838 into main Oct 10, 2024
10 checks passed
@elezar elezar deleted the get-config-from-cmdline branch October 10, 2024 09:58
mikemckiernan added a commit to mikemckiernan/cloud-native-docs that referenced this pull request Oct 31, 2024
CNT-5327

Documents NVIDIA/nvidia-container-toolkit#686

Signed-off-by: Mike McKiernan <[email protected]>
mikemckiernan added a commit to NVIDIA/cloud-native-docs that referenced this pull request Oct 31, 2024
* Update Kubernetes supported versions
Documents CNT-5342

* Support for Net Op 24.7.0
Documents CNT-5346

* MIG profiles for GH200 144GB HBM3e
Documents NVIDIA/gpu-operator#1057.

* Troubleshooting taints and tolerations

* Tolerations for clean up and upgrade CRD jobs
Documents these PRs:
- NVIDIA/gpu-operator#967
- NVIDIA/gpu-operator#960

* RBAC improvements
CNT-5289
Documents:
- NVIDIA/gpu-operator#890
- NVIDIA/gpu-operator#986

* Fixed issue with CRI-O
CNT-5327
Documents NVIDIA/nvidia-container-toolkit#686

* Automatic upgrade of CRDs
CNT-5328
Documents NVIDIA/gpu-operator#1024

* Driver pod name and IP env vars
CNT-5329
4567038
Documents NVIDIA/gpu-operator#1026

* Precompiled driver containers are GA
Signed-off-by: Mike McKiernan <[email protected]>

* KubeVirt and OSV vGPU v17.4

* GH200 NVL2 144GB HBM3e name

* Add 565 driver

* NFD version and driver container envvars

Signed-off-by: Mike McKiernan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants