Skip to content

feat(agents): v0.32 prep - Memory Stores, Multiagent, Outcomes, Skills publisher, Agent SDK runtime, Tool Search, Webhooks, Computer Use, Trajectory Replay, Elicitation#224

Merged
gizmax merged 15 commits into
mainfrom
feat/v0.32-agents-deep
May 16, 2026
Merged

Conversation

@gizmax
Copy link
Copy Markdown
Owner

@gizmax gizmax commented May 16, 2026

Summary

v0.32 "Claude Agents Deep Integration" preparation. Surfaces every Anthropic Managed Agents primitive added under the managed-agents-2026-04-01 beta umbrella (Memory Stores, Multiagent coordinator, Outcomes, Webhooks), plus separate Agent SDK and Computer Use integrations, plus Sandcastle-side differentiators (Trajectory Replay, Skills publisher, Tool Search).

Built by 9 parallel subagents over Phase 1-3 workflow, then test-fixture-aligned to the audit PR #217.

What's in

Tier 1 - Wire fixes (1 commit)

  • tools_enabled propagated to API (was ignored)
  • temperature, max_tokens, thinking_budget on ManagedAgentConfig
  • stream config field actually used
  • Pricing table (Opus 4.7 / Sonnet 4.6 / Haiku 4.5)
  • Fallback chain (list[str] up to 5)

Tier 2 - Anthropic primitives (5 modules, ~70 tests)

  • Memory Stores client + attach_to_session_payload helper, /mnt/memory/ versioned, redact endpoint, 100kB / 8-store limits
  • Multiagent coordinator + 3 pre-baked templates (research-and-write, code-review-and-test, analyst-with-translator)
  • Webhooks subscriber + HMAC handler + FastAPI router for session lifecycle events
  • Tool Search registry + 1-5 example convention + docs/tool-examples-convention.md
  • Outcomes API client + composite aggregator (user.define_outcome events + span.outcome_evaluation_end capture)

Tier 3 - Differentiators (3 modules + Agent SDK alt runtime)

  • Trajectory Replay step type with SHA-256 checksum + diff_trajectories + replay_score - leverages our audit-chain to make replays cryptographically verifiable
  • Skills Publisher with tar.gz SKILL.md package + sandcastle publish-skills CLI - Sandcastle becomes an Anthropic Skills publisher
  • Computer Use integration helper + safety pre-flight (computer-use-2025-11-24 beta)
  • Agent SDK runtime as runtime: "agent-sdk" alternative (in-process, no Managed Agents infra needed)

Wiring (4 commits)

  • New step types trajectory-replay and computer-use registered (VALID_STEP_TYPES 22 -> 24)
  • managed-agent step accepts memory_stores, multiagent, outcomes config fields
  • agent_webhooks router mounted in main.py
  • sandcastle publish-skills [--upload] [--dir] subcommand
  • runtime: "agent-sdk" dispatch in RUNTIMES registry

Dashboard

  • Live "Agent Reasoning" panel with SSE event stream on RunDetailPage
  • Supports 7 event types (thinking, tool_use, message, etc) + thread grouping + error states

Test counts

Suite Result
Phase 1 (wire fixes) 18 new tests passing
Phase 2 (9 modules in isolation) 156 new tests passing
Phase 3 (e2e wiring) 13 new tests passing
Total v0.32 prep tests 169 passing in 1.8s
Dashboard build + vitest clean + 794 passing

What's NOT in

Risk

Medium. The PR adds 24 new files + extends 5 existing core files (executor.py, dag.py, mcp_server.py, agent_runtime.py, main.py). Test coverage is high but full-suite pollution (#218) makes flake-vs-regression distinction harder. Architectural prerequisite PR #223 (StaticPool) lands separately.

How this rebases

Cleanly rebased on top of audit PR #217 (commit b244eda). Zero conflicts despite both PRs touching executor.py and generator.py - audit changes were small and orthogonal to the v0.32 wiring.

Follow-ups

After merge:

@gizmax
Copy link
Copy Markdown
Owner Author

gizmax commented May 16, 2026

CI results

Job Result Notes
Dashboard build + tests ✓ PASS 794/794 vitest
Python tests 15,176 passed / 72 failed / 1 error net +167 passing vs main baseline

Delta vs main (b244eda)

  • Passed: 15,009 → 15,176 = +167 new green tests (Phase 1 + Phase 2 + Phase 3 wiring tests)
  • Failed: 70 → 72 = +2 (within pollution-baseline noise band, same test_workflow_stats_endpoint / test_workflow_api_a2a_v27 cluster)
  • Error: 1 (same pre-existing test_race_all_fail_fallback timeout)

Zero new categories of failures. Both new failures fall in the documented #218 pollution baseline.

Verification

  • pytest tests/test_managed_agent_wires.py tests/test_memory_stores.py tests/test_multiagent.py tests/test_agent_webhooks.py tests/test_tool_search.py tests/test_outcomes.py tests/test_trajectory_replay.py tests/test_agent_skills.py tests/test_computer_use.py tests/test_agent_sdk_runtime.py tests/test_v032_wiring.py -> 169/169 passing in 1.8s locally
  • Rebase on top of audit PR chore: 2026 stack audit - SEO + deps + A2A v1.0 + retired models + MCP elicitation #217 was clean (zero conflicts)
  • Dashboard build OK after rebase

Ready for review. Recommend merging #223 (StaticPool foundation) first since it's independent, then this PR.

@gizmax gizmax merged commit 4314b72 into main May 16, 2026
1 of 2 checks passed
@gizmax gizmax deleted the feat/v0.32-agents-deep branch May 16, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant