Releases: meta-pytorch/monarch
v0.4.0
Monarch v0.4 Release Notes
New Features
Networking & RDMA
- EFA support for RDMA β RDMA with AWS's libefa (elastic fabric adapter).
- TCP fallback for RDMA β when RDMA is unavailable the data-plane automatically falls back to TCP, broadening hardware compatibility (#2999).
- ROCm / HIP support for the RDMA stack, enabling AMD GPU deployments (#2891).
- The channel transport layer was rewritten around a typed session lifecycle and unified
NetLinkdispatch, improving reconnect reliability and adding duplex-mode channels.
Distributed Telemetry & Dashboard
Monarch now ships a built-in observability dashboard. The new distributed telemetry system collects actor, mesh, host, proc, and message-level data in real time and exposes it through both a web UI and a schema-first REST API (OpenAPI 3.1). An OTLP-compatible metrics, logs, and trace exporter makes it straightforward to integrate with Grafana, Jaeger, or any OpenTelemetry collector in Kubernetes deployments.
Admin TUI & Live Diagnostics
A new terminal UI (admin_tui) provides live introspection of running meshes, procs, and actors via an HTTP admin server. It includes a built-in py-spy integration that can capture Python stack traces from any running actor directly in the TUI, making it much easier to diagnose stalls and performance issues in production.
Kubernetes
KubernetesJob gained Python-native provisioning, removing the dependency on an external Go controller for mesh creation. A new optional labels parameter on add_mesh() enables integration with Kueue and other label-based Kubernetes controllers (#2693).
Python API Changes
allocate_nonblocking,from_alloc, andhost_meshare renamed to private methods; useattach_to_workersand theKubernetesJob/ProcessJobAPIs instead (#2971).- NUMA bindings are now exposed for proc mesh spawning (#2996).
Bug Fixes & Performance Improvements
Supervision & Fault Tolerance
- ControllerController supervision β a single child torchstore controller failure no longer poisons the parent and all siblings. Each child is now isolated, fixing a critical bug where one failed session could block all subsequent
get_or_spawn_controller()calls (#2835). - Orphaned mesh cleanup β child actors now detect when their parent is unreachable and self-terminate, preventing leaked GPU resources (#2198).
- Clean Python shutdown β proc exit now calls
Py_FinalizeEx, giving Python objects a chance to run destructors and eliminating thepybind11::dec_refGIL crashes seen during shutdown (#2524). - Reliable
proc_mesh.stop()β stop now flushes pending messages and acks before exiting, fixing races that caused spurious errors in CI and user code (#2658).
Performance
- Lazy ValueMesh unpickling β values returned from
accumulateare now deserialized on access rather than eagerly, reducing latency for large results (#2983). - RLE-compressed OnceBuffer accumulation β repeated identical values are run-length encoded during accumulation, cutting memory and network cost for common broadcast patterns (#2989).
- Telemetry overhead was significantly reduced by demoting internal spans and gating channel-level tracing behind DEBUG.
Build & Packaging
- Official aarch64 (ARM64) release binaries are now published alongside x86_64 on PyPI
0.3.0
Monarch 0.3.0 Release Notes
New Features
Kubernetes Job Support
Monarch now supports running distributed training workloads on Kubernetes clusters. The new KubernetesJob API connects to pre-provisioned GPU pods managed by the https://github.com/meta-pytorch/monarch-kubernetes/ repository, enabling seamless multi-node DDP training
on Kubernetes.
Key Capabilities:
- Connect to Kubernetes pods using KubernetesJob
- Provision GPU workers via the MonarchMesh Custom Resource Definition
- Run multi-node DDP training using SPMDActor
Example:
from monarch.job.kubernetes import KubernetesJob
from monarch.spmd import SPMDActor
k8s_job = KubernetesJob(namespace="monarch-tests")
k8s_job.add_mesh("ddpmesh", num_replicas=2)
job_state = k8s_job.state()
proc_mesh = job_state.ddpmesh.spawn_procs({"gpus": 4})
spmd_actors = proc_mesh.spawn("_SPMDActor", SPMDActor)
See the full tutorial: https://meta-pytorch.org/monarch/generated/examples/ddp/kubernetes_ddp.html
We also publish docker packages, see https://github.com/meta-pytorch/monarch/pkgs/container/monarch
monarch.spmd and monarch.job.spmd SPMDJob
The new monarch.job.spmd module provides serve() and run_spmd() for an interactive SPMD development workflow:
- Reserve once, iterate many times: Allocate hosts once, then call run_spmd() repeatedly without reprovisioning
- Remote debugging: Add breakpoint() in your training script and attach with monarch debug
- Job caching: Reload cached job state and re-run on the same reserved hosts
Example:
from monarch.job.spmd import serve
job = serve(
["torchrun", "--nproc-per-node=4", "--standalone", "train.py"],
scheduler="local_cwd",
)
job.run_spmd()
# Later, reload and re-run without reprovisioning:
job = job_load(".monarch/job_state.pkl")
job.run_spmd()
This supports single-node training with command lists and multi-node training with TorchX AppDef on schedulers like Slurm.
See the example: https://meta-pytorch.org/monarch/generated/examples/ddp/spmd_job.html
Experimental Queue Dispatch Mode (Performance)
A new actor dispatch mode where Rust enqueues messages to a channel for Python to process, rather than Rust acquiring the GIL directly. This can improve throughput for message-heavy workloads.
from monarch.config import configure
configure(actor_queue_dispatch=True)
Real this_proc() for Local Spawning
The this_proc() function returns a handle to the current singleton process, enabling actors to spawn other actors locally. Remote actors can use this_proc() to spawn actors on their own hostβenabling patterns like handing out references to a local proc and having
remote actors spawn resources on it.
from monarch.actor import Actor, endpoint, this_proc
class ManagerActor(Actor):
@endpoint
def spawn_helper(self) -> HelperActor:
# Spawns HelperActor in the same process as ManagerActor
return this_proc().spawn("helper", HelperActor)
Zero-Copy Messaging Path from Python
A new Buffer class enables zero-copy message serialization from Python. Large writes (β₯256 bytes) are stored as references to Python bytes objects rather than being copied, integrating with multipart serialization for efficient vectored I/O.
from monarch._rust_bindings.monarch_hyperactor.buffers import Buffer
from monarch.config import configure
buffer = Buffer()
buffer.write(b"small") # copied into pending buffer
buffer.write(b"x" * 1000) # stored as zero-copy reference
# Configure the threshold via:
configure(small_write_threshold=256) # default
Principles of Ownership in Supervision
This release improves the supervision model for error handling in meshes, built on four core principles:
- Owned meshes: Creating new meshes always results in an owned mesh
- Single ownership: All meshes are owned by at most one actor (no transfer or suspension)
- Lifecycle binding: A mesh cannot outlive its ownerβwhen the owner dies, so does the mesh
- Graceful cleanup: Stopped meshes drain pending messages before cleanup; owned meshes clean up before their owner
Actors can now implement supervise to handle failures from owned meshes.
Example:
class ManagerActor(Actor):
def __supervise__(self, failure: MeshFailure) -> bool:
logging.error(f"failure encountered: {failure}")
# Return truthy to handle, falsey to propagate
return None
See the documentation: https://meta-pytorch.org/monarch/actors.html#error-handling-in-meshes
SkyPilot Integration (Community Contribution)
SkyPilotJob enables running Monarch on Kubernetes and cloud VMs across 20+ cloud providers (AWS, GCP, Azure, CoreWeave, Nebius, etc.) via https://skypilot.readthedocs.io/.
import sky
from monarch_skypilot import SkyPilotJob
job = SkyPilotJob(
meshes={"trainers": 2},
resources=sky.Resources(accelerators="A100:1"),
cluster_name="my-monarch-cluster",
)
state = job.state()
trainers = state.trainers # HostMesh with 2 nodes
Features:
- Automatic cluster provisioning and teardown
- Autostop for idle clusters
- Workdir sync and custom file mounts
- Default PyPI install or custom Docker images
Install with:
pip install torchmonarch-nightly skypilot[kubernetes]
Getting Started
Install Monarch 0.3.0:
pip install monarch==0.3.0
0.2.0
Monarch Release Notes
Overview
This release focuses on correctness, robustness, and operational maturity. Major improvements span supervision and shutdown semantics, logging and observability, Kubernetes readiness, SPMD workflows, test hygiene, and build compatibility. Monarch is now more predictable under failure, easier to debug, and better suited for long-running and large-scale deployments.
Supervision & Shutdown
Actor supervision and shutdown behavior has been significantly hardened and clarified.
Key Improvements
-
Strict supervision hierarchy
- Every actor or process has exactly one parent (except the root).
- Child actors can no longer persist after their parent faults or stops.
-
Reliable recursive shutdown
- Asking an actor to stop deterministically stops its entire subtree.
- Shutdown cases are documented, tested, and log spam has been audited.
-
Improved fault propagation
- Supervision errors now describe the full hierarchy of exits.
- Endpoint failures surface clearer context, including actor and endpoint names.
-
HostMesh lifecycle control
- HostMesh can be cleanly stopped (disconnect clients and kill workers).
- HostMesh can be force-killed, causing worker loops to exit immediately.
- Persistent allocations remain usable for reconnects after stop.
Logging
Logging has been refactored to improve clarity, reduce noise, and clearly separate user-facing signals from system internals.
Key Improvements
-
Clear separation of logs
- Monarch system logs and user logs are cleanly separated.
- User-visible faults are communicated only via exceptions and supervision events.
-
Improved error clarity
- Errors are categorized (e.g., user, system, infrastructure).
- Actor names are reported in user-understandable syntax.
- Actor failure reports include richer context and causal chaining.
-
Structured logging
- Errors emit structured log records suitable for filtering and aggregation.
- Supervision events follow a defined schema.
-
Reduced default noise
- Log forwarding, aggregation, and enrichment are disabled by default.
- Log messages have been audited for signal quality.
Observability
Observability has been expanded across actors, meshes, and endpoints.
Key Improvements
-
Comprehensive metrics
- Endpoint latency, throughput, payload size, and error counts are universally available.
- Metrics are collected on both client and server sides.
-
Lifecycle instrumentation
- Actor, process, and mesh state changes emit structured events.
- Supervision events are fully instrumented.
-
Root-cause visibility
- The first triggering event in a failure cascade is surfaced.
- User-parseable actor IDs are linked to internal actor identifiers.
-
Tracing
- Distributed spans cover message send and receive paths.
- Traces can be visualized via Perfetto and standard tracing backends.
-
Performance awareness
- Instrumentation overhead has been reduced and made configurable.
Build Hygiene & Compatibility
Build and dependency management has been simplified.
Key Improvements
- RDMA and tensor engine support are dynamically loaded. The same wheel can be installed
- Monarch no longer has a binary dependency on PyTorch.
- PyTorch is required only at the Python layer.
- Startup time and binary size are significantly reduced.
Networking
Networking reliability has improved, with a focus on Lightning integration.
Key Improvements
- Lightning integration works on HostMesh v1.
- Networking behavior is documented and standardized for OSS usage.
Deprecation
Legacy v0 codepath has been removed
0.1.0
π¦ Monarch v0.1.0 β Initial Release
Weβre excited to announce the first public release of Monarch, a distributed programming framework for PyTorchbuilt around scalable actor messaging and direct memory access.
Monarch brings together ideas from actor-based concurrency, fault-tolerant supervision, and high-performance tensor communication to make distributed training simpler, more explicit, and faster.
π Highlights
- Actor-Based Programming for PyTorch
Define Python classes that run remotely as actors, send them messages, and coordinate distributed work using a clean, imperative API.
from monarch.actor import Actor, endpoint, this_host
training_procs = this_host().spawn_procs({"gpus": 8})
class Trainer(Actor):
@endpoint
def train(self, step: int): ...
trainers = training_procs.spawn("trainers", Trainer)
trainers.train.call(step=0).get()
- Scalable Messaging and Meshes
Actors are organized into meshes β collections that support broadcast, gather, and other scalable communication primitives. - Supervision and Fault Tolerance
Monarch adopts supervision trees for error handling and recovery. Failures propagate predictably, allowing fine-grained restart and robust distributed workflows. - High-Performance RDMA Transfers
Full RDMA integration for CPU and GPU memory via libibverbs, providing zero-copy, one-sided tensor communication across processes and hosts. - Distributed Tensors
Native support for tensors sharded across processes β enabling distributed compute without custom data movement code.
Monarch is experimental and under active development.
Expect incomplete APIs, rapid iteration, and evolving interfaces.
We welcome contributions β please discuss significant changes or ideas via issues before submitting PRs.
v0.0.0
First Monarch Release!
https://pypi.org/project/torchmonarch/0.0.0/