Fix MeshControllerActor sending multiple messages to a reduce OncePort#3316
Open
pzhan9 wants to merge 1 commit intometa-pytorch:mainfrom
Open
Fix MeshControllerActor sending multiple messages to a reduce OncePort#3316pzhan9 wants to merge 1 commit intometa-pytorch:mainfrom
pzhan9 wants to merge 1 commit intometa-pytorch:mainfrom
Conversation
Summary: `Invocation::complete()` and `set_exception()` in the MeshControllerActor were sending individual `PythonMessage` results per rank to a reduce `OncePort`. However, `OncePort` only accepts a single delivery — the first send consumed the oneshot sender, and subsequent sends found the port closed, causing an "address not routable: port not bound in mailbox" error. This cascaded into actor tree failure and a `KeyboardInterrupt` on the test thread, manifesting as a crash in `test_value_mesh`. The fix accumulates all per-rank results into a single `PythonMessage` using `ValueOverlay` merge (the same mechanism used by the reduce port's `PythonResponseMessageAccumulator`), then sends it as one message. Also makes `PythonMessage::into_overlay` public so it can be used from `monarch_extension`. Additionally, moves the `ValueMesh` iteration in `test_value_mesh` outside the `activate()` context to avoid FakeTensor dispatch intercepting `aten.set_` during tensor unpickling (meta vs cpu device mismatch). Differential Revision: D98814620
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Invocation::complete()andset_exception()in the MeshControllerActorwere sending individual
PythonMessageresults per rank to a reduceOncePort. However,OncePortonly accepts a single delivery — the firstsend consumed the oneshot sender, and subsequent sends found the port
closed, causing an "address not routable: port not bound in mailbox" error.
This cascaded into actor tree failure and a
KeyboardInterrupton thetest thread, manifesting as a crash in
test_value_mesh.The fix accumulates all per-rank results into a single
PythonMessageusing
ValueOverlaymerge (the same mechanism used by the reduce port'sPythonResponseMessageAccumulator), then sends it as one message.Also makes
PythonMessage::into_overlaypublic so it can be used frommonarch_extension.Additionally, moves the
ValueMeshiteration intest_value_meshoutsidethe
activate()context to avoid FakeTensor dispatch interceptingaten.set_during tensor unpickling (meta vs cpu device mismatch).Differential Revision: D98814620