Skip to content
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,20 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The

### Added

- **Per-instance fan-out resume contract** (proposal 0009 / spec v0.18.0). The engine now writes a checkpoint record at every `completed` event inside a fan-out instance (in addition to the existing outermost-graph + subgraph-internal + fan-out node completion saves). On resume the engine consults the saved record's `fan_out_progress` field and treats each instance as `completed` (skip, contribution rolls forward), `in_flight` (re-run from subgraph entry), or `not_started` (dispatch normally). The `append` reducer's no-double-merge guarantee holds across resume because `completed` is a one-shot accumulator state.
- **`FanOutProgress` and `FanOutInstanceProgress` public dataclasses** on `openarmature.checkpoint`. The `CheckpointRecord.fan_out_progress` field is now `tuple[FanOutProgress, ...]` (default empty tuple), with per-instance state, result, and `completed_inner_positions` observability. Was a `None` placeholder under proposal 0008.
- **`FanOutInternalSaveBatching` config** on `InMemoryCheckpointer`. Backends MAY opt into batching scoped to fan-out instance internal saves to bound the write volume of high-instance-count fan-outs. Outermost-graph, subgraph-internal, and the fan-out node's own completion save remain synchronous regardless. Default off. Buffered-but-unflushed saves are lost on crash by design; on resume, instances whose `completed` state was only buffered revert and re-run. Surfaces a new optional `save_fan_out_internal` / `save_fan_out_in_flight_failure` Checkpointer Protocol seam; backends that don't implement either fall back to the standard `save`.
Comment thread
chris-colinsky marked this conversation as resolved.
Outdated
- **Patterns docs section** at `docs/patterns/`, sibling to Concepts. Seeded with four recipes drawn from downstream usage and proposal 0008's alternatives section: parameterized entry point, tool-dispatch-as-node, session-as-checkpoint-resume, and bypass-if-output-exists. Patterns are user-level how-to recipes composing existing primitives, not framework contracts; new patterns can be added without spec coordination. Each page follows a problem / approach / snippet / when this is the right pattern / when it isn't / cross-references structure.

### Changed

- **Fan-out resume behavior** flipped from atomic restart (0008's v1 contract) to per-instance resume. A crash mid-fan-out used to re-run the entire fan-out on resume; now only the instances that did not complete-and-record their contribution re-run. The economics matter for large fan-outs of expensive work (LLM calls, long extractions): an 80% complete fan-out crash now restores 80% of its results rather than discarding them.
- **`SQLiteCheckpointer` schema** picks up a new `fan_out_progress_blob` column (added via `ALTER TABLE` for backward compatibility with pre-0009 databases). Pre-0009 rows back-fill as NULL on load and round-trip as the empty-tuple default. Both `pickle` and `json` serialization modes round-trip the new field.

### Notes

- **Pinned spec version bumped to v0.17.1.** Proposal 0019 (multi-provider wire-format extension) reframes llm-provider §8 as a catalog of wire-format mappings, with the existing OpenAI-compatible body nested under §8.1. Purely textual on the spec side — no behavioral change, no fixture changes. Code and doc references to §8.X updated to match the new structure (§8.1 → §8.1.1, §8.2 → §8.1.2, §8.3 → §8.1.3, §8.5.1 → §8.1.5.1, §8.1.1 → §8.1.1.1). All existing conformance fixtures continue to pass.
- **Pinned spec version bumped from v0.17.1 to v0.18.0.** Proposal 0009 (per-instance fan-out resume) is the v0.18.0 driver: pipeline-utilities §10.3 / §10.7 revised, §10.11 added with per-instance state machine + composition rules + configurable batching. The `append` reducer no-double-merge invariant from §10.11.1 is the load-bearing correctness story.
- **Pinned spec version bumped to v0.17.1.** Proposal 0019 (multi-provider wire-format extension) reframes llm-provider §8 as a catalog of wire-format mappings, with the existing OpenAI-compatible body nested under §8.1. Purely textual on the spec side, no behavioral change, no fixture changes. Code and doc references to §8.X updated to match the new structure (§8.1 to §8.1.1, §8.2 to §8.1.2, §8.3 to §8.1.3, §8.5.1 to §8.1.5.1, §8.1.1 to §8.1.1.1). All existing conformance fixtures continue to pass.
Comment thread
chris-colinsky marked this conversation as resolved.
Outdated
Comment thread
chris-colinsky marked this conversation as resolved.
Outdated

Comment thread
chris-colinsky marked this conversation as resolved.
## [0.8.0] — 2026-05-23

Expand Down
29 changes: 21 additions & 8 deletions docs/concepts/checkpointing.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,12 @@ graph = (
```

The engine writes a record at every `completed` event for outermost-
graph nodes and subgraph-internal nodes. **Fan-out instance internal
events do NOT save** in the shipping version. Atomic-restart is the
fan-out contract.
graph nodes, subgraph-internal nodes, and fan-out instance internal
nodes. **Per-instance fan-out resume** is the contract: on resume the
engine re-runs only the instances that did not complete-and-record
their contribution into the fan-out accumulator in the prior run;
completed instances skip and their contributions roll forward to the
fan-in step.

## Saves are synchronous-by-contract

Expand Down Expand Up @@ -79,8 +82,8 @@ class CheckpointRecord:
completed_positions: tuple[NodePosition, ...]
parent_states: tuple[Any, ...]
last_saved_at: float
schema_version: str = CHECKPOINT_SCHEMA_VERSION
fan_out_progress: None = field(default=None)
schema_version: str = ""
fan_out_progress: tuple[FanOutProgress, ...] = field(default=())
```

Field framing worth getting right:
Expand All @@ -103,9 +106,19 @@ Field framing worth getting right:
Outermost first; empty for an outer-level save. Inner-node saves
populate it so resume can re-enter a subgraph from the right
depth without re-projecting.
- **`fan_out_progress: None` is reserved** for a future per-instance
fan-out resume mode (planned, not yet shipped). In the shipping
version it's always `None`.
- **`fan_out_progress` carries per-fan-out-node progress** when one
or more fan-outs are in flight at save time. Each `FanOutProgress`
entry records the fan-out's name, namespace, instance count, and a
per-instance state machine (`not_started` / `in_flight` /
`completed`) plus the recorded contribution for finalized
instances. On resume the engine consults this field to decide
which instances skip (their contributions roll forward) vs re-run
(re-execute from the inner subgraph's declared entry node). Empty
tuple when no fan-outs are in flight. See
[Resume semantics](fan-out.md#resume-semantics) on the fan-out
page for the full per-instance contract including reducer
composition, error_policy semantics, and the optional
fan-out-internal save batching.

## The Checkpointer Protocol

Expand Down
61 changes: 53 additions & 8 deletions docs/concepts/fan-out.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,14 +125,59 @@ namespace.
## Resume semantics

A fan-out node's `completed` event triggers a save like any other
outermost-graph or subgraph-internal node. **Per-instance internal
events do NOT save** in the shipping version; on resume, the
fan-out re-runs end-to-end if it hadn't completed (atomic restart).

A per-instance fan-out resume mode is planned but not yet shipped.
The `fan_out_progress` field on `CheckpointRecord` is reserved for
its eventual contents. Until it lands, atomic restart is the
shipping behavior.
outermost-graph or subgraph-internal node. Per-instance internal
events also save, and the resume contract is **per-instance**: the
engine consults the saved record's `fan_out_progress` entry for
this fan-out and treats each instance as one of three states:

- **`completed`**: the instance ran to completion in the prior run
and recorded its contribution into the accumulator. The engine
skips re-execution on resume; the contribution rolls forward to
the fan-in step.
- **`in_flight`**: the instance began execution but its terminal
inner node had not yet fired `completed` at save time, so no
contribution was recorded. On resume the engine re-runs the
instance from the subgraph's declared entry node.
`completed_inner_positions` on the saved record are observational
only; they do NOT serve as a per-inner-node resume point.
- **`not_started`**: the instance was not dispatched at save time.
On resume the engine dispatches it normally.

The `append` reducer's no-double-merge guarantee holds because
`completed` is a one-shot accumulator state: every completed
instance's contribution rolls forward exactly once at fan-in.

Under `error_policy: collect`, a failed instance's error record IS
a `completed` contribution (the error rolls forward through the
`errors_field` bucket rather than `target_field`). Under
`error_policy: fail_fast`, a failed instance leaves the saved
record with that instance in `in_flight` state; cancelled siblings
are `in_flight` or `not_started`. None are `completed`, so resume
re-runs them all.

Per-instance saves can be high-volume in fan-outs with many
instances or many inner nodes per instance. `Checkpointer` backends
MAY opt into **configurable batching** scoped to fan-out instance
internal saves; outermost-graph, subgraph-internal, and the fan-out
node's own completion save remain synchronous. The in-memory
backend exposes the knob via:

```python
from openarmature.checkpoint import (
InMemoryCheckpointer,
FanOutInternalSaveBatching,
)

cp = InMemoryCheckpointer(
fan_out_internal_save_batching=FanOutInternalSaveBatching(flush_every=10),
)
```

Buffered-but-unflushed saves are lost on crash by design:
instances whose `completed` state was only buffered revert to
`in_flight` / `not_started` on resume and re-run. The trade-off is
explicit (fewer writes per fan-out instance vs some redundant
re-execution on crash recovery); default is no batching.

## When to reach for fan-out

Expand Down
22 changes: 22 additions & 0 deletions docs/examples/05-fan-out-with-retry.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,28 @@ sentinel headline that always raises `ProviderUnavailable`; under
`error_policy` at runtime. Inner-instance events carry
`fan_out_index` but not the config.

## Composing with checkpointing

This example doesn't register a `Checkpointer`, but the fan-out
pattern composes cleanly with checkpoint resume. When a fan-out
runs under a registered backend, the resume contract is
**per-instance**: instances that completed in the prior run skip
re-execution and their contributions roll forward through the
fan-in step; instances that were `in_flight` at save time re-run
from the subgraph's entry node; not-started instances dispatch
normally. The `append` reducer's no-double-merge guarantee holds
across resume because `completed` is a one-shot accumulator state.

Composition with `instance_middleware` (retry): on resume, an
instance's `attempt_index` resets to 0 (a fresh retry budget) per
spec graph-engine §6's resume semantics. So a retry-exhausted
instance whose `in_flight` state was saved gets a fresh budget on
the resumed run.

See [Resume semantics in fan-out](../concepts/fan-out.md#resume-semantics)
and the [Checkpointing concept page](../concepts/checkpointing.md)
for the full contract.

## How to run

```bash
Expand Down
2 changes: 1 addition & 1 deletion openarmature-spec
Submodule openarmature-spec updated 23 files
+36 −0 CHANGELOG.md
+2 −7 README.md
+134 −0 docs/open-questions.md
+1 −1 docs/proposals.md
+3 −0 mkdocs.yml
+3 −2 proposals/0009-pipeline-utilities-per-instance-fan-out-resume.md
+0 −44 spec/pipeline-utilities/conformance/028-checkpoint-fan-out-atomic-restart.md
+0 −83 spec/pipeline-utilities/conformance/028-checkpoint-fan-out-atomic-restart.yaml
+46 −0 spec/pipeline-utilities/conformance/048-checkpoint-fan-out-per-instance-resume-skips-completed.md
+96 −0 spec/pipeline-utilities/conformance/048-checkpoint-fan-out-per-instance-resume-skips-completed.yaml
+45 −0 spec/pipeline-utilities/conformance/049-checkpoint-fan-out-per-instance-resume-append-reducer.md
+92 −0 spec/pipeline-utilities/conformance/049-checkpoint-fan-out-per-instance-resume-append-reducer.yaml
+47 −0 spec/pipeline-utilities/conformance/050-checkpoint-fan-out-in-flight-instance-restart.md
+109 −0 spec/pipeline-utilities/conformance/050-checkpoint-fan-out-in-flight-instance-restart.yaml
+48 −0 spec/pipeline-utilities/conformance/051-checkpoint-fan-out-fail-fast-resume.md
+96 −0 spec/pipeline-utilities/conformance/051-checkpoint-fan-out-fail-fast-resume.yaml
+49 −0 spec/pipeline-utilities/conformance/052-checkpoint-fan-out-collect-errors-resume.md
+110 −0 spec/pipeline-utilities/conformance/052-checkpoint-fan-out-collect-errors-resume.yaml
+50 −0 spec/pipeline-utilities/conformance/053-checkpoint-fan-out-instance-middleware-retry-resume.md
+98 −0 spec/pipeline-utilities/conformance/053-checkpoint-fan-out-instance-middleware-retry-resume.yaml
+58 −0 spec/pipeline-utilities/conformance/054-checkpoint-fan-out-batching-buffered-saves-lost-on-crash.md
+139 −0 spec/pipeline-utilities/conformance/054-checkpoint-fan-out-batching-buffered-saves-lost-on-crash.yaml
+245 −71 spec/pipeline-utilities/spec.md
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Repository = "https://github.com/LunarCommand/openarmature-python"
Specification = "https://github.com/LunarCommand/openarmature-spec"

[tool.openarmature]
spec_version = "0.17.1"
spec_version = "0.18.1"

[dependency-groups]
dev = [
Expand Down
2 changes: 1 addition & 1 deletion src/openarmature/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""OpenArmature: workflow framework for LLM pipelines and tool-calling agents."""

__version__ = "0.8.0"
__spec_version__ = "0.17.1"
__spec_version__ = "0.18.1"
7 changes: 6 additions & 1 deletion src/openarmature/checkpoint/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
restores from a prior record.
"""

from .backends import InMemoryCheckpointer, SerializationMode, SQLiteCheckpointer
from .backends import FanOutInternalSaveBatching, InMemoryCheckpointer, SerializationMode, SQLiteCheckpointer
from .errors import (
CheckpointError,
CheckpointNotFound,
Expand All @@ -36,6 +36,8 @@
CheckpointFilter,
CheckpointRecord,
CheckpointSummary,
FanOutInstanceProgress,
FanOutProgress,
NodePosition,
)

Expand All @@ -51,6 +53,9 @@
"CheckpointStateMigrationMissing",
"CheckpointSummary",
"Checkpointer",
"FanOutInstanceProgress",
"FanOutInternalSaveBatching",
"FanOutProgress",
"InMemoryCheckpointer",
"MigrationRegistry",
"NodePosition",
Expand Down
3 changes: 2 additions & 1 deletion src/openarmature/checkpoint/backends/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,11 @@
from openarmature.checkpoint import InMemoryCheckpointer, SQLiteCheckpointer
"""

from .memory import InMemoryCheckpointer
from .memory import FanOutInternalSaveBatching, InMemoryCheckpointer
from .sqlite import SerializationMode, SQLiteCheckpointer

__all__ = [
"FanOutInternalSaveBatching",
"InMemoryCheckpointer",
"SQLiteCheckpointer",
"SerializationMode",
Expand Down
Loading