Skip to content

Conversation

@Pacho20
Copy link
Contributor

@Pacho20 Pacho20 commented Nov 14, 2025

- Description of the problem which is fixed/What is the use case
The DaemonSet installation mode requires manual node reboots, which complicates both installation and uninstallation. This can confuse users and results in a poor user experience. The introduced changes aim to eliminate the need for reboots and make the process more seamless, although the current solution is not fully complete.

- What I did
Removed the configuration script and DaemonSet, as they made the installation process unnecessarily complex. Added rpm-ostree live-apply to enable Kata on worker nodes without rebooting. Extended the controller to schedule the installation process so that nodes are updated one at a time.

Initially, I tried to use systemctl reload for CRI-O instead of a full restart. This would have been a better solution because it avoids interrupting both CRI-O and kubelet and does not rely on their state restoration (which can fail in some cases). While reload works and CRI-O reloads its configuration, it fails to locate the executable for Kata runtime, returning the error:
stat: no such file or directory for /usr/bin/containerd-shim-kata.
The binary exists and works, but CRI-O cannot find it. I investigated multiple possibilities: checking the file in CRI-O’s namespace using nsenter, verifying permissions, SELinux flags, mount options, kernel parameters—everything suggests CRI-O should be able to invoke the binary. I still have a few ideas to check the interaction between CRI-O and the kernel during this lookup. If you have any insight into why this happens or how to fix it, that would greatly simplify the installation process.

As a fallback, I verified that CRI-O can invoke the Kata runtime after a full restart, so that is the current approach. However, this is not ideal because restarting CRI-O also triggers a kubelet restart (at least on ROKS). Installation works reliably because rpm-ostree install takes time, giving kubelet a chance to recover. Uninstallation, however, fails: although the script waits for the node to be in a "Ready" state, the node never becomes "NotReady" during kubelet restart. This means uninstall runs on other nodes while kubelet is still recovering, and triggers kubelet restarts on those nodes too, leaving the cluster in a broken state where most pods enter ImagePullBackOff or CrashLoopBackOff. Recovery is possible by restarting pods in the right order, but this should not happen. I could not find a reliable way to detect when kubelet and CRI-O are fully restored, so for now I reintroduced manual reboot for uninstallation.

One reason I pursued this approach is that the community operator uses a similar method successfully. They use the kata-deploy script - see:
https://github.com/kata-containers/kata-containers/blob/main/tools/packaging/kata-deploy/scripts/kata-deploy.sh#L764C10-L778.

Currently, I see three possible paths forward:

  • Figure out how to make systemctl reload work for CRI-O without breaking Kata runtime.
  • Find a reliable way to ensure kubelet and CRI-O are fully recover before proceeding.
  • Drain nodes before installation and uninstallation. This approach is already implemented in other operators and would ensure stability, but I wanted to avoid it because it adds significant operational complexity and increases overall duration (draining and uncordoning nodes takes considerable time).

- How to verify it
Build the operator using the updated scripts/kata-install/Dockerfile. Apply the KataConfig CR and wait until all nodes reach the "installed" status.

- Description for the changelog
Change DaemonSet mode to eliminate node reboots (installation only; uninstallation still requires reboot for now).

EDIT: this is expected to fix https://issues.redhat.com/browse/KATA-4233

Added a check to the osc-kata-install script to ensure the installation only
proceeds when the NODE_LABEL environment variable is set.
This prevents unintended behavior during daemonset deployment.

Signed-off-by: Patrik Fodor <[email protected]>
…emonSet

- Migrated peer-pods configuration handling into the osc-rpm DaemonSet.
- Prepares for the transition where config files are bundled with the rpm package.
- Simplifies the overall installation process and operator logic.
- Lays groundwork for installing Kata Containers without requiring node reboot.

Signed-off-by: Patrik Fodor <[email protected]>
@openshift-ci openshift-ci bot requested review from pmores and vvoronko November 14, 2025 14:12
@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 14, 2025
@openshift-ci
Copy link

openshift-ci bot commented Nov 14, 2025

Hi @Pacho20. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@c3d
Copy link
Contributor

c3d commented Nov 17, 2025

Hey @Pacho20, I may be mistaken, but I think that live-apply works by remounting /usr with a new overlayfs. So I see it as a possibility that an existing process might still see the old mount and old directory if it kept an old file descriptor for the old directory. A bit in the same way that a process can see a deleted file if it still has a file descriptor to it. Just an hypothesis, but you could patch crio to show the content of the filesystem to validate it.

That would explain why you need to kill and restart the process.

@c3d
Copy link
Contributor

c3d commented Nov 17, 2025

The title of the first commit is a big long. And GitHub still does not have a way to comment directly on commit messages.

exit 1
}

[[ -z "${NODE_LABEL:-}" ]] && {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of :- here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -u option is set, so if the variable is unset, it results in an error. Therefore, we don’t necessarily need :-, but using it allows us to set the variable to an empty string and thereby control the error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant: You are adding a test to check if NODE_LABEL is not set, with an error message that essentially is a almost exactly what bash itself would give you.

I see no functional difference between

ERROR: NODE_LABEL env var must be set

and

NODEL_LABEL: unbound variable

So that code segment seems pointless to me.

Alternative: make the error message better, and send it to stderr instead of stdout.

RUN mkdir -p /scripts

ADD osc-kata-install.sh osc-configs-script.sh osc-log-level.sh lib.sh /scripts/
ADD osc-kata-install.sh osc-log-level.sh lib.sh /scripts/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why merging two scripts into one makes things simpler. Isn't it clearer what each step does when they are separated? Aren't there cases where you would want to reconfigure without reinstalling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed most of the content from that script, so I thought moving it into the other one wouldn’t be a problem. I don’t think there are any cases that require reconfiguration. The two functions I moved to the other script will be removed anyway, since these config files will be part of the RPM package. Nevertheless, I think you’re right - it makes more sense to keep them in separate scripts.

rm -rf /host/tmp/extensions/

# Copy configs
copy_kata_remote_config_files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to simplify the workflow, the same overall simplification could be achieved by simply invoking the config script, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you're right about that.

for _, node := range nodes {
r.Log.Info("node must be rebooted", "node", node.Name)
}
//r.scheduleInstallation(UninstallKata)
Copy link
Contributor

@c3d c3d Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a TODO or a leftover? AFAICT Uninstall is implemented, so looks to me like that code should be uncommented and the code above it removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think that the problem is that uninstall without reboot won't work. Add a comment here then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. uninstall ran into issues so it's just disabled. As the primary use case I want fixed with this is worker updates, uninstall I'm fine leaving as customer must manually reboot worker to finish uninstall. It was left as a may implement later.

exec_on_host "systemctl daemon-reload"
exec_on_host "systemctl restart crio"

wait_till_node_is_ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a timeout on the various wait and error out if we exceed it?

If you want to use the timeout command, you will need to either export the wait functions or put them in some separate script. Or you can add your own custom timeout. But having no node reach the ready state seems like a condition we should be ready to deal with gracefully.

@gcoon151
Copy link

Hey @Pacho20, I may be mistaken, but I think that live-apply works by remounting /usr with a new overlayfs. So I see it as a possibility that an existing process might still see the old mount and old directory if it kept an old file descriptor for the old directory. A bit in the same way that a process can see a deleted file if it still has a file descriptor to it. Just an hypothesis, but you could patch crio to show the content of the filesystem to validate it.

That would explain why you need to kill and restart the process.

I offered node debug and lsof might answer this also

@Pacho20
Copy link
Contributor Author

Pacho20 commented Nov 21, 2025

Hey @Pacho20, I may be mistaken, but I think that live-apply works by remounting /usr with a new overlayfs. So I see it as a possibility that an existing process might still see the old mount and old directory if it kept an old file descriptor for the old directory. A bit in the same way that a process can see a deleted file if it still has a file descriptor to it. Just an hypothesis, but you could patch crio to show the content of the filesystem to validate it.

That would explain why you need to kill and restart the process.

I had a similar hypothesis. I checked the file descriptors with lsof as @gcoon151 suggested, along with other checks. Everything seemed to indicate that CRI-O should see the new binary. So instead of using the script, I performed the reload manually, and it seems that after live-apply there needs to be some time for CRI-O to recognize the new mount option. If I wait a few seconds, that part of the reload works - I can see in the logs that the kata-remote runtime handler loads. Although it logs an error message:

:39:01.361971609Z" level=error msg="Getting /usr/bin/containerd-shim-kata-v2 OCI runtime features failed: io.containerd.kata.v2: shim namespace cannot be empty: exit status 1" file="config/config.go:1379"

I checked where it comes from, and the config validator seems to ignore it and load the new runtime anyway. The relevant part of crio status config's output:

[crio.runtime.runtimes.kata-remote]
  runtime_config_path = "/opt/kata/configuration-remote.toml"
  runtime_path = "/usr/bin/containerd-shim-kata-v2"
  runtime_type = "vm"
  runtime_root = "/run/vc"
  privileged_without_host_devices = true
  allowed_annotations = ["io.kubernetes.cri-o.Devices"]
  runtime_pull_image = true
  container_min_memory = "12MiB"
  no_sync_log = false

But even with successful reloading, when kubelet tries to invoke the RunPodSandbox function, CRI-O returns this error:

Nov 21 15:04:16 kube-d4e4pr6w0o0iqm3niodg-fodoroscdev-default-00000203 kubenswrapper[2570]: I1121 15:04:16.321546    2570 kuberuntime_sandbox.go:65] "Running pod with runtime handler" pod="openshift-sandboxed-containers-operator/helloworld-887478c6-hjgbc" runtimeHandler="kata-remote"
Nov 21 15:04:16 kube-d4e4pr6w0o0iqm3niodg-fodoroscdev-default-00000203 kubenswrapper[2570]: E1121 15:04:16.322549    2570 log.go:32] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to find runtime handler kata-remote from runtime list map[crun:0xc0004cbb00 runc:0xc0004cb680]"
Nov 21 15:04:16 kube-d4e4pr6w0o0iqm3niodg-fodoroscdev-default-00000203 kubenswrapper[2570]: E1121 15:04:16.322616    2570 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to find runtime handler kata-remote from runtime list map[crun:0xc0004cbb00 runc:0xc0004cb680]" pod="openshift-sandboxed-containers-operator/helloworld-887478c6-hjgbc"
Nov 21 15:04:16 kube-d4e4pr6w0o0iqm3niodg-fodoroscdev-default-00000203 kubenswrapper[2570]: E1121 15:04:16.322637    2570 kuberuntime_manager.go:1237] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to find runtime handler kata-remote from runtime list map[crun:0xc0004cbb00 runc:0xc0004cb680]" pod="openshift-sandboxed-containers-operator/helloworld-887478c6-hjgbc"
Nov 21 15:04:16 kube-d4e4pr6w0o0iqm3niodg-fodoroscdev-default-00000203 kubenswrapper[2570]: E1121 15:04:16.322680    2570 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"helloworld-887478c6-hjgbc_openshift-sandboxed-containers-operator(f9d52ed0-f321-4400-848f-046e943b408a)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"helloworld-887478c6-hjgbc_openshift-sandboxed-containers-operator(f9d52ed0-f321-4400-848f-046e943b408a)\\\": rpc error: code = Unknown desc = failed to find runtime handler kata-remote from runtime list map[crun:0xc0004cbb00 runc:0xc0004cb680]\"" pod="openshift-sandboxed-containers-operator/helloworld-887478c6-hjgbc" podUID="f9d52ed0-f321-4400-848f-046e943b408a"

You can find the code related to the error message here and the method using it, which is used in every RunPodSandbox implementation. So it seems that for some reason the Server does not have the same runtime list. I didn’t have time to figure out the exact reason for that, but for now, reload is still not an option.

The other thing I noticed (which is probably worse) while checking the kubelet logs is that the openshift-sandboxed-containers-monitor pods can’t start after the live-apply. The error is the following: Error: container create failed: write to /proc/self/attr/keycreate: Invalid argument. I couldn’t find much about it, but it seems to be an SELinux-related problem. It goes away after a node reboot, but that’s what we’re trying to avoid. I still need to understand the exact reason. I see a possibility where this could prevent the installation without reboots, but we’ll know more after further investigation.

If anyone has more insight, I’d gladly hear it.

@c3d
Copy link
Contributor

c3d commented Nov 26, 2025

@Pacho20 In the current state, this clearly needs quite a bit more work. Would you mind flagging it as do-not-merge until the PR is in a more complete state?

@Pacho20
Copy link
Contributor Author

Pacho20 commented Nov 27, 2025

@c3d Yeah I know it's not ready. @gkurz asked me to open the PR so we can figure this out together.
I don’t know how to flag the PR, but I can convert it to a draft instead if that’s fine.

@Pacho20 Pacho20 marked this pull request as draft November 27, 2025 17:07
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 27, 2025
This change introduces the use of rpm-ostree apply-live and restarts CRI-O,
allowing both kata and kata-remote to function without requiring node reboots.

Signed-off-by: Patrik Fodor <[email protected]>
@Pacho20 Pacho20 force-pushed the daemonset-install-without-reboot branch from 09b0a6c to 47d28ac Compare December 3, 2025 13:15
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 3, 2025
@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants