feat(agents): v0.32 prep - Memory Stores, Multiagent, Outcomes, Skills publisher, Agent SDK runtime, Tool Search, Webhooks, Computer Use, Trajectory Replay, Elicitation by gizmax · Pull Request #224 · gizmax/Sandcastle

gizmax · 2026-05-16T17:09:21Z

Summary

v0.32 "Claude Agents Deep Integration" preparation. Surfaces every Anthropic Managed Agents primitive added under the managed-agents-2026-04-01 beta umbrella (Memory Stores, Multiagent coordinator, Outcomes, Webhooks), plus separate Agent SDK and Computer Use integrations, plus Sandcastle-side differentiators (Trajectory Replay, Skills publisher, Tool Search).

Built by 9 parallel subagents over Phase 1-3 workflow, then test-fixture-aligned to the audit PR #217.

What's in

Tier 1 - Wire fixes (1 commit)

tools_enabled propagated to API (was ignored)
temperature, max_tokens, thinking_budget on ManagedAgentConfig
stream config field actually used
Pricing table (Opus 4.7 / Sonnet 4.6 / Haiku 4.5)
Fallback chain (list[str] up to 5)

Tier 2 - Anthropic primitives (5 modules, ~70 tests)

Memory Stores client + attach_to_session_payload helper, /mnt/memory/ versioned, redact endpoint, 100kB / 8-store limits
Multiagent coordinator + 3 pre-baked templates (research-and-write, code-review-and-test, analyst-with-translator)
Webhooks subscriber + HMAC handler + FastAPI router for session lifecycle events
Tool Search registry + 1-5 example convention + docs/tool-examples-convention.md
Outcomes API client + composite aggregator (user.define_outcome events + span.outcome_evaluation_end capture)

Tier 3 - Differentiators (3 modules + Agent SDK alt runtime)

Trajectory Replay step type with SHA-256 checksum + diff_trajectories + replay_score - leverages our audit-chain to make replays cryptographically verifiable
Skills Publisher with tar.gz SKILL.md package + sandcastle publish-skills CLI - Sandcastle becomes an Anthropic Skills publisher
Computer Use integration helper + safety pre-flight (computer-use-2025-11-24 beta)
Agent SDK runtime as runtime: "agent-sdk" alternative (in-process, no Managed Agents infra needed)

Wiring (4 commits)

New step types trajectory-replay and computer-use registered (VALID_STEP_TYPES 22 -> 24)
managed-agent step accepts memory_stores, multiagent, outcomes config fields
agent_webhooks router mounted in main.py
sandcastle publish-skills [--upload] [--dir] subcommand
runtime: "agent-sdk" dispatch in RUNTIMES registry

Dashboard

Live "Agent Reasoning" panel with SSE event stream on RunDetailPage
Supports 7 event types (thinking, tool_use, message, etc) + thread grouping + error states

Test counts

Suite	Result
Phase 1 (wire fixes)	18 new tests passing
Phase 2 (9 modules in isolation)	156 new tests passing
Phase 3 (e2e wiring)	13 new tests passing
Total v0.32 prep tests	169 passing in 1.8s
Dashboard build + vitest	clean + 794 passing

What's NOT in

No version bump - this PR is "prep" because the audit PR chore: 2026 stack audit - SEO + deps + A2A v1.0 + retired models + MCP elicitation #217 went out first as v0.31.x foundation.
No CHANGELOG / WhatsNew - those land with the v0.32.0 release commit.
No PyPI publish - waiting for release sign-off.

Risk

Medium. The PR adds 24 new files + extends 5 existing core files (executor.py, dag.py, mcp_server.py, agent_runtime.py, main.py). Test coverage is high but full-suite pollution (#218) makes flake-vs-regression distinction harder. Architectural prerequisite PR #223 (StaticPool) lands separately.

How this rebases

Cleanly rebased on top of audit PR #217 (commit b244eda). Zero conflicts despite both PRs touching executor.py and generator.py - audit changes were small and orthogonal to the v0.32 wiring.

Follow-ups

After merge:

v0.32.0 release commit (version bump + CHANGELOG + WhatsNew page)
Build + twine upload + tag
Site updates (homepage v0.31 -> v0.32, new Sandcastle Lite story)
Cherry-pick fix(db): StaticPool for in-memory SQLite + module-cache reset between tests #223 (StaticPool) ahead of v0.32.0 if not already merged

…table + fallback chain

…e (v0.32 prep)

…v0.32 prep)

….32 prep)

…0.32 prep)

…ills (v0.32 prep)

….32 prep)

… (v0.32 prep)

…utcomes wiring (v0.32 prep)

…runtime dispatch (v0.32 prep)

gizmax · 2026-05-16T17:18:52Z

CI results

Job	Result	Notes
Dashboard build + tests	✓ PASS	794/794 vitest
Python tests	15,176 passed / 72 failed / 1 error	net +167 passing vs main baseline

Delta vs main (`b244eda`)

Passed: 15,009 → 15,176 = +167 new green tests (Phase 1 + Phase 2 + Phase 3 wiring tests)
Failed: 70 → 72 = +2 (within pollution-baseline noise band, same test_workflow_stats_endpoint / test_workflow_api_a2a_v27 cluster)
Error: 1 (same pre-existing test_race_all_fail_fallback timeout)

Zero new categories of failures. Both new failures fall in the documented #218 pollution baseline.

Verification

pytest tests/test_managed_agent_wires.py tests/test_memory_stores.py tests/test_multiagent.py tests/test_agent_webhooks.py tests/test_tool_search.py tests/test_outcomes.py tests/test_trajectory_replay.py tests/test_agent_skills.py tests/test_computer_use.py tests/test_agent_sdk_runtime.py tests/test_v032_wiring.py -> 169/169 passing in 1.8s locally
Rebase on top of audit PR chore: 2026 stack audit - SEO + deps + A2A v1.0 + retired models + MCP elicitation #217 was clean (zero conflicts)
Dashboard build OK after rebase

Ready for review. Recommend merging #223 (StaticPool foundation) first since it's independent, then this PR.

Tomas Pflanzer added 15 commits May 16, 2026 19:08

fix(agents): wire tools_enabled + sampling params + stream + pricing …

2fb8cc8

…table + fallback chain

feat(agents): Anthropic Memory Stores client (v0.32 prep)

25bef6e

feat(agents): tool search + tool use examples registry (v0.32 prep)

d67b44b

feat(agents): webhooks subscriber + HMAC handler for session lifecycl…

2eb8fbc

…e (v0.32 prep)

feat(agents): multiagent coordinator helper + 3 pre-baked templates (…

a947e76

…v0.32 prep)

feat(agents): Computer Use integration helper + safety pre-flight (v0…

48020d7

….32 prep)

feat(agents): Anthropic Outcomes API client + composite aggregator (v…

6896d2f

…0.32 prep)

feat(agents): trajectory replay step type primitives (v0.32 prep)

20abc5e

feat(agents): Claude Agent SDK alternative runtime (v0.32 prep)

940bde7

feat(agents): Anthropic Skills publisher - workflows as uploadable Sk…

1cafa60

…ills (v0.32 prep)

feat(dashboard): live Agent Reasoning panel with SSE event stream (v0…

54cf82d

….32 prep)

feat(agents): register trajectory-replay + computer-use as step types…

0017927

… (v0.32 prep)

feat(agents): managed-agent step gains memory_stores + multiagent + o…

ae665ae

…utcomes wiring (v0.32 prep)

feat(agents): mount webhooks router + publish-skills CLI + agent-sdk …

9bc64d7

…runtime dispatch (v0.32 prep)

test(agents): end-to-end wiring tests for v0.32 prep modules

6a616e6

gizmax merged commit 4314b72 into main May 16, 2026
1 of 2 checks passed

gizmax deleted the feat/v0.32-agents-deep branch May 16, 2026 17:39

gizmax mentioned this pull request May 16, 2026

release: v0.32.0 - Claude Agents Deep Integration #225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(agents): v0.32 prep - Memory Stores, Multiagent, Outcomes, Skills publisher, Agent SDK runtime, Tool Search, Webhooks, Computer Use, Trajectory Replay, Elicitation#224

feat(agents): v0.32 prep - Memory Stores, Multiagent, Outcomes, Skills publisher, Agent SDK runtime, Tool Search, Webhooks, Computer Use, Trajectory Replay, Elicitation#224
gizmax merged 15 commits into
mainfrom
feat/v0.32-agents-deep

gizmax commented May 16, 2026

Uh oh!

gizmax commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gizmax commented May 16, 2026

Summary

What's in

Tier 1 - Wire fixes (1 commit)

Tier 2 - Anthropic primitives (5 modules, ~70 tests)

Tier 3 - Differentiators (3 modules + Agent SDK alt runtime)

Wiring (4 commits)

Dashboard

Test counts

What's NOT in

Risk

How this rebases

Follow-ups

Uh oh!

gizmax commented May 16, 2026

CI results

Delta vs main (b244eda)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Delta vs main (`b244eda`)