-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Try to use pidfd and epoll to wait init process exit #4517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ import ( | |
"golang.org/x/sys/unix" | ||
|
||
"github.com/opencontainers/cgroups" | ||
"github.com/opencontainers/runc/internal/linux" | ||
"github.com/opencontainers/runc/libcontainer/configs" | ||
"github.com/opencontainers/runc/libcontainer/exeseal" | ||
"github.com/opencontainers/runc/libcontainer/intelrdt" | ||
|
@@ -377,9 +378,13 @@ func (c *Container) start(process *Process) (retErr error) { | |
|
||
// Signal sends a specified signal to container's init. | ||
// | ||
// When s is SIGKILL and the container does not have its own PID namespace, all | ||
// the container's processes are killed. In this scenario, the libcontainer | ||
// When s is SIGKILL: | ||
// 1. If the container does not have its own PID namespace, all the | ||
// container's processes are killed. In this scenario, the libcontainer | ||
// user may be required to implement a proper child reaper. | ||
// 2. Otherwise, we just send the SIGKILL signal to the init process, | ||
// but we don't wait for the init process to disappear. If you want to | ||
// wait, please use c.EnsureKilled instead. | ||
func (c *Container) Signal(s os.Signal) error { | ||
c.m.Lock() | ||
defer c.m.Unlock() | ||
|
@@ -431,6 +436,91 @@ func (c *Container) signal(s os.Signal) error { | |
return nil | ||
} | ||
|
||
func (c *Container) killViaPidfd() error { | ||
c.m.Lock() | ||
defer c.m.Unlock() | ||
|
||
// To avoid a PID reuse attack, don't kill non-running container. | ||
if !c.hasInit() { | ||
return ErrNotRunning | ||
} | ||
|
||
pidfd, err := unix.PidfdOpen(c.initProcess.pid(), 0) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know the old code was killing only the init process. But would it make sense to kill all the processes in the cgroup instead? We can do it as another PR, if that makes sense. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As you mentioned in the above, we can consider this order:
But I think we still need to consider whether the container has a private pid ns or not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taking into account this: #4517 (comment). It seems simpler to kill pid1 if it has it's own pidns. Otherwise, cgroup.kill or, if not supported, send a signal. Does it make sense? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it's no need to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exactly what I said in the last comment :) |
||
if err != nil { | ||
return err | ||
} | ||
defer unix.Close(pidfd) | ||
|
||
epollfd, err := unix.EpollCreate1(unix.EPOLL_CLOEXEC) | ||
if err != nil { | ||
return err | ||
} | ||
defer unix.Close(epollfd) | ||
|
||
event := unix.EpollEvent{ | ||
Events: unix.EPOLLIN, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think doing epoll with only one fd is kind of overkill, I wonder if there are simpler solutions. But if, as I mentioned in another comment, we can kill all the processes in the cgroup, then the epoll might be worth it? |
||
Fd: int32(pidfd), | ||
} | ||
if err := unix.EpollCtl(epollfd, unix.EPOLL_CTL_ADD, pidfd, &event); err != nil { | ||
return err | ||
} | ||
|
||
if err := unix.PidfdSendSignal(pidfd, unix.SIGKILL, nil, 0); err != nil { | ||
return err | ||
} | ||
|
||
events := make([]unix.EpollEvent, 1) | ||
// Set the timeout to 10s, the same as in kill below. | ||
n, err := linux.EpollWait(epollfd, events, 10000) | ||
if err != nil { | ||
return err | ||
} | ||
if n > 0 { | ||
for i := range n { | ||
event := events[i] | ||
if event.Fd == int32(pidfd) { | ||
return nil | ||
lifubang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
} | ||
} | ||
return errors.New("container init still running") | ||
} | ||
|
||
func (c *Container) kill() error { | ||
_ = c.Signal(unix.SIGKILL) | ||
|
||
// For containers running in a low load machine, we only need to wait about 1ms. | ||
time.Sleep(time.Millisecond) | ||
if err := c.Signal(unix.Signal(0)); err != nil { | ||
return nil | ||
} | ||
|
||
// For some containers in a heavy load machine, we need to wait more time. | ||
logrus.Debugln("We need more time to wait the init process exit.") | ||
for i := 0; i < 100; i++ { | ||
time.Sleep(100 * time.Millisecond) | ||
if err := c.Signal(unix.Signal(0)); err != nil { | ||
return nil | ||
} | ||
} | ||
return errors.New("container init still running") | ||
} | ||
|
||
// EnsureKilled kills the container and waits for the kernel to finish killing it. | ||
func (c *Container) EnsureKilled() error { | ||
// When a container doesn't have a private pidns, we have to kill all processes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/doesn't/does/? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Opening again, sorry, but the if checks that it does have a private PID namespace. The comment says it doesn't. Something seems odd. Am I missing something? |
||
// in the cgroup, it's more simpler to use `cgroup.kill` or `unix.Kill`. | ||
if c.config.Namespaces.IsPrivate(configs.NEWPID) { | ||
kolyshkin marked this conversation as resolved.
Show resolved
Hide resolved
lifubang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
var err error | ||
if err = c.killViaPidfd(); err == nil { | ||
return nil | ||
} | ||
|
||
logrus.Debugf("pidfd & epoll failed, falling back to unix.Signal: %v", err) | ||
} | ||
return c.kill() | ||
Comment on lines
+515
to
+521
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't wrap my head around this either. Why don't we use a pidfd in Having a (public/exported) What we are doing now seems kind of complex:
If what I propose doesn't seem okay, I'm open to other ways to simplify this :) |
||
} | ||
|
||
func (c *Container) createExecFifo() (retErr error) { | ||
rootuid, err := c.config.HostRootUID() | ||
if err != nil { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
epoll_wait returns -1 on error.Let's return -1 here too.