Skip to content

Try to use pidfd and epoll to wait init process exit #4517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 13 additions & 25 deletions delete.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,11 @@ import (
"fmt"
"os"
"path/filepath"
"time"

"github.com/opencontainers/runc/libcontainer"
"github.com/urfave/cli"

"golang.org/x/sys/unix"
)

func killContainer(container *libcontainer.Container) error {
_ = container.Signal(unix.SIGKILL)
for range 100 {
time.Sleep(100 * time.Millisecond)
if err := container.Signal(unix.Signal(0)); err != nil {
return container.Destroy()
}
}
return errors.New("container init still running")
}

var deleteCommand = cli.Command{
Name: "delete",
Usage: "delete any resources held by the container often used with detached container",
Expand Down Expand Up @@ -65,25 +51,27 @@ status of "ubuntu01" as "stopped" the following will delete resources held for
}
return err
}
// When --force is given, we kill all container processes and
// then destroy the container. This is done even for a stopped
// container, because (in case it does not have its own PID
// namespace) there may be some leftover processes in the
// container's cgroup.
if force {
return killContainer(container)
}
s, err := container.Status()
if err != nil {
return err
}
switch s {
case libcontainer.Stopped:
return container.Destroy()
// If the container is stopped, we can just destroy it.
case libcontainer.Created:
return killContainer(container)
if err := container.EnsureKilled(); err != nil {
return err
}
default:
return fmt.Errorf("cannot delete container %s that is not stopped: %s", id, s)
if !force {
return fmt.Errorf("cannot delete container %s that is not stopped: %s", id, s)
}
// When --force is given, we kill all container processes and
// then destroy the container.
if err := container.EnsureKilled(); err != nil {
return err
}
}
return container.Destroy()
},
}
11 changes: 11 additions & 0 deletions internal/linux/linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,14 @@ func Sendmsg(fd int, p, oob []byte, to unix.Sockaddr, flags int) error {
})
return os.NewSyscallError("sendmsg", err)
}

// EpollWait wraps [unix.EpollWait].
func EpollWait(epfd int, events []unix.EpollEvent, msec int) (n int, err error) {
n, err = retryOnEINTR2(func() (int, error) {
return unix.EpollWait(epfd, events, msec)
})
if err != nil {
return 0, os.NewSyscallError("epollwait", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

epoll_wait returns -1 on error.Let's return -1 here too.

}
return n, nil
}
3 changes: 3 additions & 0 deletions libcontainer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,9 @@ container.Resume()
// send signal to container's init process.
container.Signal(signal)

// send signal to container's init process and waits for the kernel to finish killing it.
container.EnsureKilled()

// update container resource constraints.
container.Set(config)

Expand Down
94 changes: 92 additions & 2 deletions libcontainer/container_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"golang.org/x/sys/unix"

"github.com/opencontainers/cgroups"
"github.com/opencontainers/runc/internal/linux"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/exeseal"
"github.com/opencontainers/runc/libcontainer/intelrdt"
Expand Down Expand Up @@ -377,9 +378,13 @@ func (c *Container) start(process *Process) (retErr error) {

// Signal sends a specified signal to container's init.
//
// When s is SIGKILL and the container does not have its own PID namespace, all
// the container's processes are killed. In this scenario, the libcontainer
// When s is SIGKILL:
// 1. If the container does not have its own PID namespace, all the
// container's processes are killed. In this scenario, the libcontainer
// user may be required to implement a proper child reaper.
// 2. Otherwise, we just send the SIGKILL signal to the init process,
// but we don't wait for the init process to disappear. If you want to
// wait, please use c.EnsureKilled instead.
func (c *Container) Signal(s os.Signal) error {
c.m.Lock()
defer c.m.Unlock()
Expand Down Expand Up @@ -431,6 +436,91 @@ func (c *Container) signal(s os.Signal) error {
return nil
}

func (c *Container) killViaPidfd() error {
c.m.Lock()
defer c.m.Unlock()

// To avoid a PID reuse attack, don't kill non-running container.
if !c.hasInit() {
return ErrNotRunning
}

pidfd, err := unix.PidfdOpen(c.initProcess.pid(), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the old code was killing only the init process. But would it make sense to kill all the processes in the cgroup instead?

We can do it as another PR, if that makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned in the above, we can consider this order:

  1. Use cgroup.kill if the kernel supports it
  2. Use pidfd if the kernel supports it
  3. Just send a signal otherwise

But I think we still need to consider whether the container has a private pid ns or not.
What I think it's that, it's reasonable to kill only the init process for a private pid ns container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account this: #4517 (comment).

It seems simpler to kill pid1 if it has it's own pidns. Otherwise, cgroup.kill or, if not supported, send a signal.

Does it make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's no need to use cgroup.kill to kill all process in the cgroup to kill the pid1 if it has a private own pidns. We just only need to kill the exact pid1 process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly what I said in the last comment :)

if err != nil {
return err
}
defer unix.Close(pidfd)

epollfd, err := unix.EpollCreate1(unix.EPOLL_CLOEXEC)
if err != nil {
return err
}
defer unix.Close(epollfd)

event := unix.EpollEvent{
Events: unix.EPOLLIN,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing epoll with only one fd is kind of overkill, I wonder if there are simpler solutions. But if, as I mentioned in another comment, we can kill all the processes in the cgroup, then the epoll might be worth it?

Fd: int32(pidfd),
}
if err := unix.EpollCtl(epollfd, unix.EPOLL_CTL_ADD, pidfd, &event); err != nil {
return err
}

if err := unix.PidfdSendSignal(pidfd, unix.SIGKILL, nil, 0); err != nil {
return err
}

events := make([]unix.EpollEvent, 1)
// Set the timeout to 10s, the same as in kill below.
n, err := linux.EpollWait(epollfd, events, 10000)
if err != nil {
return err
}
if n > 0 {
for i := range n {
event := events[i]
if event.Fd == int32(pidfd) {
return nil
}
}
}
return errors.New("container init still running")
}

func (c *Container) kill() error {
_ = c.Signal(unix.SIGKILL)

// For containers running in a low load machine, we only need to wait about 1ms.
time.Sleep(time.Millisecond)
if err := c.Signal(unix.Signal(0)); err != nil {
return nil
}

// For some containers in a heavy load machine, we need to wait more time.
logrus.Debugln("We need more time to wait the init process exit.")
for i := 0; i < 100; i++ {
time.Sleep(100 * time.Millisecond)
if err := c.Signal(unix.Signal(0)); err != nil {
return nil
}
}
return errors.New("container init still running")
}

// EnsureKilled kills the container and waits for the kernel to finish killing it.
func (c *Container) EnsureKilled() error {
// When a container doesn't have a private pidns, we have to kill all processes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/doesn't/does/?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening again, sorry, but the if checks that it does have a private PID namespace. The comment says it doesn't. Something seems odd. Am I missing something?

// in the cgroup, it's more simpler to use `cgroup.kill` or `unix.Kill`.
if c.config.Namespaces.IsPrivate(configs.NEWPID) {
var err error
if err = c.killViaPidfd(); err == nil {
return nil
}

logrus.Debugf("pidfd & epoll failed, falling back to unix.Signal: %v", err)
}
return c.kill()
Comment on lines +515 to +521
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't wrap my head around this either. Why don't we use a pidfd in c.Signal() (if that is available, if it's not we fallback to sending a signal to the pid number) and create a new function to wait on the process to die? Again, if that if pidfd is possible, we wait on that, if it's not, we fallback to wait as we do now.

Having a (public/exported) Kill() and KillViaPidfd() seems like something that should be abstracted.

What we are doing now seems kind of complex:

  • If it has a pidns, then try to kill with a pidfd the pid1
  • If it doesn't have a pidns, then try to kill the process sending a signal to the pid number (even if pidfd is supported, why?). The function we call here also handles the case of having a private pidns, which makes this more tricky.

If what I propose doesn't seem okay, I'm open to other ways to simplify this :)

}

func (c *Container) createExecFifo() (retErr error) {
rootuid, err := c.config.HostRootUID()
if err != nil {
Expand Down