Skip to content

Releases: meta-pytorch/monarch

v0.4.0

26 Mar 20:52

Choose a tag to compare

Monarch v0.4 Release Notes

New Features

Networking & RDMA

  • EFA support for RDMA β€” RDMA with AWS's libefa (elastic fabric adapter).
  • TCP fallback for RDMA β€” when RDMA is unavailable the data-plane automatically falls back to TCP, broadening hardware compatibility (#2999).
  • ROCm / HIP support for the RDMA stack, enabling AMD GPU deployments (#2891).
  • The channel transport layer was rewritten around a typed session lifecycle and unified NetLink dispatch, improving reconnect reliability and adding duplex-mode channels.

Distributed Telemetry & Dashboard

Monarch now ships a built-in observability dashboard. The new distributed telemetry system collects actor, mesh, host, proc, and message-level data in real time and exposes it through both a web UI and a schema-first REST API (OpenAPI 3.1). An OTLP-compatible metrics, logs, and trace exporter makes it straightforward to integrate with Grafana, Jaeger, or any OpenTelemetry collector in Kubernetes deployments.

Admin TUI & Live Diagnostics

A new terminal UI (admin_tui) provides live introspection of running meshes, procs, and actors via an HTTP admin server. It includes a built-in py-spy integration that can capture Python stack traces from any running actor directly in the TUI, making it much easier to diagnose stalls and performance issues in production.

Kubernetes

KubernetesJob gained Python-native provisioning, removing the dependency on an external Go controller for mesh creation. A new optional labels parameter on add_mesh() enables integration with Kueue and other label-based Kubernetes controllers (#2693).

Python API Changes

  • allocate_nonblocking, from_alloc, and host_mesh are renamed to private methods; use attach_to_workers and the KubernetesJob / ProcessJob APIs instead (#2971).
  • NUMA bindings are now exposed for proc mesh spawning (#2996).

Bug Fixes & Performance Improvements

Supervision & Fault Tolerance

  • ControllerController supervision β€” a single child torchstore controller failure no longer poisons the parent and all siblings. Each child is now isolated, fixing a critical bug where one failed session could block all subsequent get_or_spawn_controller() calls (#2835).
  • Orphaned mesh cleanup β€” child actors now detect when their parent is unreachable and self-terminate, preventing leaked GPU resources (#2198).
  • Clean Python shutdown β€” proc exit now calls Py_FinalizeEx, giving Python objects a chance to run destructors and eliminating the pybind11::dec_ref GIL crashes seen during shutdown (#2524).
  • Reliable proc_mesh.stop() β€” stop now flushes pending messages and acks before exiting, fixing races that caused spurious errors in CI and user code (#2658).

Performance

  • Lazy ValueMesh unpickling β€” values returned from accumulate are now deserialized on access rather than eagerly, reducing latency for large results (#2983).
  • RLE-compressed OnceBuffer accumulation β€” repeated identical values are run-length encoded during accumulation, cutting memory and network cost for common broadcast patterns (#2989).
  • Telemetry overhead was significantly reduced by demoting internal spans and gating channel-level tracing behind DEBUG.

Build & Packaging

  • Official aarch64 (ARM64) release binaries are now published alongside x86_64 on PyPI

0.3.0

30 Jan 22:27

Choose a tag to compare

Monarch 0.3.0 Release Notes

New Features

Kubernetes Job Support

Monarch now supports running distributed training workloads on Kubernetes clusters. The new KubernetesJob API connects to pre-provisioned GPU pods managed by the https://github.com/meta-pytorch/monarch-kubernetes/ repository, enabling seamless multi-node DDP training
on Kubernetes.

Key Capabilities:

  • Connect to Kubernetes pods using KubernetesJob
  • Provision GPU workers via the MonarchMesh Custom Resource Definition
  • Run multi-node DDP training using SPMDActor

Example:

  from monarch.job.kubernetes import KubernetesJob
  from monarch.spmd import SPMDActor

  k8s_job = KubernetesJob(namespace="monarch-tests")
  k8s_job.add_mesh("ddpmesh", num_replicas=2)

  job_state = k8s_job.state()
  proc_mesh = job_state.ddpmesh.spawn_procs({"gpus": 4})
  spmd_actors = proc_mesh.spawn("_SPMDActor", SPMDActor)

See the full tutorial: https://meta-pytorch.org/monarch/generated/examples/ddp/kubernetes_ddp.html

We also publish docker packages, see https://github.com/meta-pytorch/monarch/pkgs/container/monarch


monarch.spmd and monarch.job.spmd SPMDJob

The new monarch.job.spmd module provides serve() and run_spmd() for an interactive SPMD development workflow:

  • Reserve once, iterate many times: Allocate hosts once, then call run_spmd() repeatedly without reprovisioning
  • Remote debugging: Add breakpoint() in your training script and attach with monarch debug
  • Job caching: Reload cached job state and re-run on the same reserved hosts
  Example:

  from monarch.job.spmd import serve

  job = serve(
      ["torchrun", "--nproc-per-node=4", "--standalone", "train.py"],
      scheduler="local_cwd",
  )
  job.run_spmd()

 # Later, reload and re-run without reprovisioning:
  job = job_load(".monarch/job_state.pkl")
  job.run_spmd()

This supports single-node training with command lists and multi-node training with TorchX AppDef on schedulers like Slurm.

See the example: https://meta-pytorch.org/monarch/generated/examples/ddp/spmd_job.html


Experimental Queue Dispatch Mode (Performance)

A new actor dispatch mode where Rust enqueues messages to a channel for Python to process, rather than Rust acquiring the GIL directly. This can improve throughput for message-heavy workloads.

  from monarch.config import configure

  configure(actor_queue_dispatch=True)

Real this_proc() for Local Spawning

The this_proc() function returns a handle to the current singleton process, enabling actors to spawn other actors locally. Remote actors can use this_proc() to spawn actors on their own hostβ€”enabling patterns like handing out references to a local proc and having
remote actors spawn resources on it.

from monarch.actor import Actor, endpoint, this_proc

class ManagerActor(Actor):
    @endpoint
    def spawn_helper(self) -> HelperActor:
        # Spawns HelperActor in the same process as ManagerActor
        return this_proc().spawn("helper", HelperActor)

Zero-Copy Messaging Path from Python

A new Buffer class enables zero-copy message serialization from Python. Large writes (β‰₯256 bytes) are stored as references to Python bytes objects rather than being copied, integrating with multipart serialization for efficient vectored I/O.

from monarch._rust_bindings.monarch_hyperactor.buffers import Buffer
from monarch.config import configure

  buffer = Buffer()
  buffer.write(b"small")       # copied into pending buffer
  buffer.write(b"x" * 1000)    # stored as zero-copy reference

  # Configure the threshold via:
  configure(small_write_threshold=256)  # default

Principles of Ownership in Supervision

This release improves the supervision model for error handling in meshes, built on four core principles:

  1. Owned meshes: Creating new meshes always results in an owned mesh
  2. Single ownership: All meshes are owned by at most one actor (no transfer or suspension)
  3. Lifecycle binding: A mesh cannot outlive its ownerβ€”when the owner dies, so does the mesh
  4. Graceful cleanup: Stopped meshes drain pending messages before cleanup; owned meshes clean up before their owner

Actors can now implement supervise to handle failures from owned meshes.

Example:

  class ManagerActor(Actor):
      def __supervise__(self, failure: MeshFailure) -> bool:
          logging.error(f"failure encountered: {failure}")
          # Return truthy to handle, falsey to propagate
          return None

See the documentation: https://meta-pytorch.org/monarch/actors.html#error-handling-in-meshes


SkyPilot Integration (Community Contribution)

SkyPilotJob enables running Monarch on Kubernetes and cloud VMs across 20+ cloud providers (AWS, GCP, Azure, CoreWeave, Nebius, etc.) via https://skypilot.readthedocs.io/.

  import sky
  from monarch_skypilot import SkyPilotJob

  job = SkyPilotJob(
      meshes={"trainers": 2},
      resources=sky.Resources(accelerators="A100:1"),
      cluster_name="my-monarch-cluster",
  )
  state = job.state()
  trainers = state.trainers  # HostMesh with 2 nodes

Features:

  • Automatic cluster provisioning and teardown
  • Autostop for idle clusters
  • Workdir sync and custom file mounts
  • Default PyPI install or custom Docker images

Install with:

pip install torchmonarch-nightly skypilot[kubernetes]


Getting Started

Install Monarch 0.3.0:

pip install monarch==0.3.0

0.2.0

22 Dec 20:54

Choose a tag to compare

Monarch Release Notes

Overview

This release focuses on correctness, robustness, and operational maturity. Major improvements span supervision and shutdown semantics, logging and observability, Kubernetes readiness, SPMD workflows, test hygiene, and build compatibility. Monarch is now more predictable under failure, easier to debug, and better suited for long-running and large-scale deployments.


Supervision & Shutdown

Actor supervision and shutdown behavior has been significantly hardened and clarified.

Key Improvements

  • Strict supervision hierarchy

    • Every actor or process has exactly one parent (except the root).
    • Child actors can no longer persist after their parent faults or stops.
  • Reliable recursive shutdown

    • Asking an actor to stop deterministically stops its entire subtree.
    • Shutdown cases are documented, tested, and log spam has been audited.
  • Improved fault propagation

    • Supervision errors now describe the full hierarchy of exits.
    • Endpoint failures surface clearer context, including actor and endpoint names.
  • HostMesh lifecycle control

    • HostMesh can be cleanly stopped (disconnect clients and kill workers).
    • HostMesh can be force-killed, causing worker loops to exit immediately.
    • Persistent allocations remain usable for reconnects after stop.

Logging

Logging has been refactored to improve clarity, reduce noise, and clearly separate user-facing signals from system internals.

Key Improvements

  • Clear separation of logs

    • Monarch system logs and user logs are cleanly separated.
    • User-visible faults are communicated only via exceptions and supervision events.
  • Improved error clarity

    • Errors are categorized (e.g., user, system, infrastructure).
    • Actor names are reported in user-understandable syntax.
    • Actor failure reports include richer context and causal chaining.
  • Structured logging

    • Errors emit structured log records suitable for filtering and aggregation.
    • Supervision events follow a defined schema.
  • Reduced default noise

    • Log forwarding, aggregation, and enrichment are disabled by default.
    • Log messages have been audited for signal quality.

Observability

Observability has been expanded across actors, meshes, and endpoints.

Key Improvements

  • Comprehensive metrics

    • Endpoint latency, throughput, payload size, and error counts are universally available.
    • Metrics are collected on both client and server sides.
  • Lifecycle instrumentation

    • Actor, process, and mesh state changes emit structured events.
    • Supervision events are fully instrumented.
  • Root-cause visibility

    • The first triggering event in a failure cascade is surfaced.
    • User-parseable actor IDs are linked to internal actor identifiers.
  • Tracing

    • Distributed spans cover message send and receive paths.
    • Traces can be visualized via Perfetto and standard tracing backends.
  • Performance awareness

    • Instrumentation overhead has been reduced and made configurable.

Build Hygiene & Compatibility

Build and dependency management has been simplified.

Key Improvements

  • RDMA and tensor engine support are dynamically loaded. The same wheel can be installed
  • Monarch no longer has a binary dependency on PyTorch.
    • PyTorch is required only at the Python layer.
    • Startup time and binary size are significantly reduced.

Networking

Networking reliability has improved, with a focus on Lightning integration.

Key Improvements

  • Lightning integration works on HostMesh v1.
  • Networking behavior is documented and standardized for OSS usage.

Deprecation

Legacy v0 codepath has been removed

0.1.0

22 Oct 05:02

Choose a tag to compare

πŸ¦‹ Monarch v0.1.0 β€” Initial Release
We’re excited to announce the first public release of Monarch, a distributed programming framework for PyTorchbuilt around scalable actor messaging and direct memory access.
Monarch brings together ideas from actor-based concurrency, fault-tolerant supervision, and high-performance tensor communication to make distributed training simpler, more explicit, and faster.

πŸš€ Highlights

  1. Actor-Based Programming for PyTorch
    Define Python classes that run remotely as actors, send them messages, and coordinate distributed work using a clean, imperative API.
from monarch.actor import Actor, endpoint, this_host

training_procs = this_host().spawn_procs({"gpus": 8})

class Trainer(Actor):
    @endpoint
    def train(self, step: int): ...

trainers = training_procs.spawn("trainers", Trainer)
trainers.train.call(step=0).get()
  1. Scalable Messaging and Meshes
    Actors are organized into meshes β€” collections that support broadcast, gather, and other scalable communication primitives.
  2. Supervision and Fault Tolerance
    Monarch adopts supervision trees for error handling and recovery. Failures propagate predictably, allowing fine-grained restart and robust distributed workflows.
  3. High-Performance RDMA Transfers
    Full RDMA integration for CPU and GPU memory via libibverbs, providing zero-copy, one-sided tensor communication across processes and hosts.
  4. Distributed Tensors
    Native support for tensors sharded across processes β€” enabling distributed compute without custom data movement code.

⚠️ Early Development Notice
Monarch is experimental and under active development.
Expect incomplete APIs, rapid iteration, and evolving interfaces.
We welcome contributions β€” please discuss significant changes or ideas via issues before submitting PRs.

v0.0.0

03 Sep 17:15

Choose a tag to compare

v0.0.0 Pre-release
Pre-release