Skip to content

[hyperactor_mesh] two-phase shutdown for HostMesh#3210

Open
mariusae wants to merge 2 commits intogh/mariusae/323/basefrom
gh/mariusae/323/head
Open

[hyperactor_mesh] two-phase shutdown for HostMesh#3210
mariusae wants to merge 2 commits intogh/mariusae/323/basefrom
gh/mariusae/323/head

Conversation

@mariusae
Copy link
Copy Markdown
Member

@mariusae mariusae commented Mar 25, 2026

Stack from ghstack (oldest at bottom):

HostMesh::shutdown and HostMesh::stop now use a two-phase approach:

  1. Terminate all user procs concurrently across hosts via a new
    TerminateChildren message. Service infrastructure (host agent,
    comm proc, networking) stays alive so forwarder flushes can
    still reach remote hosts.
  2. Shut down/stop hosts concurrently. No user procs remain, so
    this is fast and cannot deadlock on cross-host flush timeouts.

Previously, each host's ShutdownHost handler terminated children
and then tore down networking atomically. Under concurrent shutdown,
one host could destroy its networking while another host's dying
procs were still flushing forwarders to it, causing hangs until
MESSAGE_DELIVERY_TIMEOUT expired.

Also bumps comm test timeouts from 60s to 120s to accommodate
stress-run CPU contention.

Differential Revision: D98180932

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

HostMesh::shutdown and HostMesh::stop now use a two-phase approach:

1. Terminate all user procs concurrently across hosts via a new
   TerminateChildren message. Service infrastructure (host agent,
   comm proc, networking) stays alive so forwarder flushes can
   still reach remote hosts.
2. Shut down/stop hosts concurrently. No user procs remain, so
   this is fast and cannot deadlock on cross-host flush timeouts.

Previously, each host's ShutdownHost handler terminated children
and then tore down networking atomically. Under concurrent shutdown,
one host could destroy its networking while another host's dying
procs were still flushing forwarders to it, causing hangs until
MESSAGE_DELIVERY_TIMEOUT expired.

Also bumps comm test timeouts from 60s to 120s to accommodate
stress-run CPU contention.

Differential Revision: [D98180932](https://our.internmc.facebook.com/intern/diff/D98180932/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D98180932/)!

[ghstack-poisoned]
This was referenced Mar 24, 2026
mariusae added a commit that referenced this pull request Mar 25, 2026
HostMesh::shutdown and HostMesh::stop now use a two-phase approach:

1. Terminate all user procs concurrently across hosts via a new
   TerminateChildren message. Service infrastructure (host agent,
   comm proc, networking) stays alive so forwarder flushes can
   still reach remote hosts.
2. Shut down/stop hosts concurrently. No user procs remain, so
   this is fast and cannot deadlock on cross-host flush timeouts.

Previously, each host's ShutdownHost handler terminated children
and then tore down networking atomically. Under concurrent shutdown,
one host could destroy its networking while another host's dying
procs were still flushing forwarders to it, causing hangs until
MESSAGE_DELIVERY_TIMEOUT expired.

Also bumps comm test timeouts from 60s to 120s to accommodate
stress-run CPU contention.

Differential Revision: [D98180932](https://our.internmc.facebook.com/intern/diff/D98180932/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D98180932/)!

ghstack-source-id: 357664288
Pull Request resolved: #3210
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2026
HostMesh::shutdown and HostMesh::stop now use a two-phase approach:

1. Terminate all user procs concurrently across hosts via a new
   TerminateChildren message. Service infrastructure (host agent,
   comm proc, networking) stays alive so forwarder flushes can
   still reach remote hosts.
2. Shut down/stop hosts concurrently. No user procs remain, so
   this is fast and cannot deadlock on cross-host flush timeouts.

Previously, each host's ShutdownHost handler terminated children
and then tore down networking atomically. Under concurrent shutdown,
one host could destroy its networking while another host's dying
procs were still flushing forwarders to it, causing hangs until
MESSAGE_DELIVERY_TIMEOUT expired.

Also bumps comm test timeouts from 60s to 120s to accommodate
stress-run CPU contention.

Differential Revision: [D98180932](https://our.internmc.facebook.com/intern/diff/D98180932/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D98180932/)!

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant