Fix MeshControllerActor sending multiple messages to a reduce OncePort by pzhan9 · Pull Request #3316 · meta-pytorch/monarch

pzhan9 · 2026-03-31T02:01:46Z

Summary:
Invocation::complete() and set_exception() in the MeshControllerActor
were sending individual PythonMessage results per rank to a reduce
OncePort. However, OncePort only accepts a single delivery — the first
send consumed the oneshot sender, and subsequent sends found the port
closed, causing an "address not routable: port not bound in mailbox" error.
This cascaded into actor tree failure and a KeyboardInterrupt on the
test thread, manifesting as a crash in test_value_mesh.

The fix accumulates all per-rank results into a single PythonMessage
using ValueOverlay merge (the same mechanism used by the reduce port's
PythonResponseMessageAccumulator), then sends it as one message.

Also makes PythonMessage::into_overlay public so it can be used from
monarch_extension.

Additionally, moves the ValueMesh iteration in test_value_mesh outside
the activate() context to avoid FakeTensor dispatch intercepting
aten.set_ during tensor unpickling (meta vs cpu device mismatch).

Differential Revision: D98814620

Summary: `Invocation::complete()` and `set_exception()` in the MeshControllerActor were sending individual `PythonMessage` results per rank to a reduce `OncePort`. However, `OncePort` only accepts a single delivery — the first send consumed the oneshot sender, and subsequent sends found the port closed, causing an "address not routable: port not bound in mailbox" error. This cascaded into actor tree failure and a `KeyboardInterrupt` on the test thread, manifesting as a crash in `test_value_mesh`. The fix accumulates all per-rank results into a single `PythonMessage` using `ValueOverlay` merge (the same mechanism used by the reduce port's `PythonResponseMessageAccumulator`), then sends it as one message. Also makes `PythonMessage::into_overlay` public so it can be used from `monarch_extension`. Additionally, moves the `ValueMesh` iteration in `test_value_mesh` outside the `activate()` context to avoid FakeTensor dispatch intercepting `aten.set_` during tensor unpickling (meta vs cpu device mismatch). Differential Revision: D98814620

meta-codesync · 2026-03-31T02:01:53Z

@pzhan9 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98814620.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 31, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MeshControllerActor sending multiple messages to a reduce OncePort#3316

Fix MeshControllerActor sending multiple messages to a reduce OncePort#3316
pzhan9 wants to merge 1 commit intometa-pytorch:mainfrom
pzhan9:export-D98814620

pzhan9 commented Mar 31, 2026

Uh oh!

meta-codesync bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pzhan9 commented Mar 31, 2026

Uh oh!

meta-codesync bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant