Skip to content

Fix MeshControllerActor sending multiple messages to a reduce OncePort#3316

Open
pzhan9 wants to merge 1 commit intometa-pytorch:mainfrom
pzhan9:export-D98814620
Open

Fix MeshControllerActor sending multiple messages to a reduce OncePort#3316
pzhan9 wants to merge 1 commit intometa-pytorch:mainfrom
pzhan9:export-D98814620

Conversation

@pzhan9
Copy link
Copy Markdown
Contributor

@pzhan9 pzhan9 commented Mar 31, 2026

Summary:
Invocation::complete() and set_exception() in the MeshControllerActor
were sending individual PythonMessage results per rank to a reduce
OncePort. However, OncePort only accepts a single delivery — the first
send consumed the oneshot sender, and subsequent sends found the port
closed, causing an "address not routable: port not bound in mailbox" error.
This cascaded into actor tree failure and a KeyboardInterrupt on the
test thread, manifesting as a crash in test_value_mesh.

The fix accumulates all per-rank results into a single PythonMessage
using ValueOverlay merge (the same mechanism used by the reduce port's
PythonResponseMessageAccumulator), then sends it as one message.

Also makes PythonMessage::into_overlay public so it can be used from
monarch_extension.

Additionally, moves the ValueMesh iteration in test_value_mesh outside
the activate() context to avoid FakeTensor dispatch intercepting
aten.set_ during tensor unpickling (meta vs cpu device mismatch).

Differential Revision: D98814620

Summary:
`Invocation::complete()` and `set_exception()` in the MeshControllerActor
were sending individual `PythonMessage` results per rank to a reduce
`OncePort`. However, `OncePort` only accepts a single delivery — the first
send consumed the oneshot sender, and subsequent sends found the port
closed, causing an "address not routable: port not bound in mailbox" error.
This cascaded into actor tree failure and a `KeyboardInterrupt` on the
test thread, manifesting as a crash in `test_value_mesh`.

The fix accumulates all per-rank results into a single `PythonMessage`
using `ValueOverlay` merge (the same mechanism used by the reduce port's
`PythonResponseMessageAccumulator`), then sends it as one message.

Also makes `PythonMessage::into_overlay` public so it can be used from
`monarch_extension`.

Additionally, moves the `ValueMesh` iteration in `test_value_mesh` outside
the `activate()` context to avoid FakeTensor dispatch intercepting
`aten.set_` during tensor unpickling (meta vs cpu device mismatch).

Differential Revision: D98814620
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 31, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 31, 2026

@pzhan9 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98814620.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant