Skip to content

Cut over to stoa native tool calling and structured outputs (follow-up to flarexio/stoa#46) #6

@flarexium

Description

@flarexium

Goal

Migrate the bookkeeping agent off the prompt-engineered tool-call envelope onto stoa's new native function-calling + response_format: json_schema protocol delivered by flarexio/stoa#46. Eliminate the "two JSON shapes" instruction block from the prompt and let find_accounts flow through the provider's native tools / tool_calls channel.

Background

Today the bookkeeping agent reaches the OpenAI adapter through stoa's ReasoningResult[Intent] envelope (agent/agent.go:85, agent/prompt.go:122-129). The prompt asks the model to "return JSON in ONE of these two shapes": a tool_calls envelope or an intent envelope. The adapter sets response_format: json_object (lenient mode) and parses whatever string the model emits.

Observed failure mode with GPT-5.4 mini: the model emits the same JSON object twice in one turn ({...}{...}), or fills both intent and tool_calls, or fills neither. JSONDecoder.Decode then rejects the payload and the run aborts. Larger models hide the same weakness behind better instruction-following; the underlying schema and protocol are the root cause.

flarexio/stoa#46 fixes this at the source: ReasoningResult[TIntent] is replaced by a discriminated ReasoningOutput[TIntent] that carries either Intent or ToolCalls (never both); the OpenAI adapter switches to json_schema strict mode for intents and registers tools through params.Tools. The breaking change lands in stoa as a v1.x → v2.0 bump.

This issue tracks the downstream cutover in accounting once that stoa release is available.

Scope

Allowed changes:

  • Bump github.com/flarexio/stoa to the version that ships Move tool calls and intent output to provider-native structured outputs (revisits #29 deferral) stoa#46.
  • Replace agent.accountTools (map[string]loop.ToolHandler) with the new []loop.Tool shape, declaring find_accounts's args JSON schema alongside its handler.
  • Provide the Intent JSON schema to the OpenAI adapter (hand-written json.RawMessage per the stoa#46 v1 plan), kept next to the Intent type in bookkeeping/intent.go so it cannot drift from the Go shape.
  • Strip the "two shapes" instruction block from agent/prompt.go — drop toolCallJSONShape, intentEnvelopeShape, and the "Return JSON with this exact shape" text. The prompt should describe what to do, not how to format the answer.
  • Simplify bookkeeperSystemPrompt accordingly (no more "Output JSON only").
  • Update agent/tools.go and any tests that construct tool registrations.

Out of scope:

  • Changing the Intent discriminated-union shape or any domain validator behaviour.
  • Anthropic or other-provider adapters — track separately once stoa adds them.
  • Scripted-engine tests covering the old envelope; update them to the new contract but do not extend coverage.
  • Renderer changes beyond removing the format-shape text.

Requirements

  • Bump the stoa dependency to the release containing Move tool calls and intent output to provider-native structured outputs (revisits #29 deferral) stoa#46.
  • find_accounts is registered through params.Tools end-to-end; resp.Choices[0].Message.ToolCalls is the path that drives tool execution.
  • post_journal, reverse_journal, reject are emitted as JSON validated by an Intent JSON schema with strict: true.
  • The prompt no longer contains JSON shape examples for the model's output; intent payload skeletons (postJournalArgsShape etc.) may stay if they document the domain, but not as response-format instructions.
  • The integration test suite under agent/integration_test.go passes against GPT-5.4 mini without the "two JSON objects" failure mode (run manually; CI may still gate only on scripted engine).
  • No regression in scripted-engine tests.

Acceptance Criteria

  • go.mod references the stoa release containing Move tool calls and intent output to provider-native structured outputs (revisits #29 deferral) stoa#46.
  • agent/prompt.go no longer contains toolCallJSONShape or the "Return JSON with this exact shape" instruction text.
  • find_accounts is registered via the new loop.Tool mechanism and is invoked through OpenAI native tool-calling against a live model in manual verification.
  • go test ./... passes.
  • A manual run of the bookkeeping TUI against gpt-5.4-mini completes the canonical "現金銷售商品 / 含稅" scenario from issue discussion without parse errors.
  • PR description records which manual scenarios were run and the model used.

Verification

Automated:

go test ./...

Manual (record results in the PR):

  1. Run the TUI against gpt-5.4-mini with a representative non-trivial post (e.g. taxable sales with multi-line debits/credits).
  2. Run the same scenario against a larger model (e.g. gpt-5.4) to confirm no regression.
  3. Trigger a deliberate validation failure (e.g. closed period) and confirm the model's reject intent flows through the new schema.

Expected Output

When complete, the worker should:

  1. Create a branch.
  2. Commit changes.
  3. Push the branch.
  4. Open exactly one PR linked to this issue for review.
  5. Do not merge the PR.
  6. Comment with the PR URL, summary, tests run, models verified against, and any remaining risks.

Dependencies

Blocked by flarexio/stoa#46. Do not start until the corresponding stoa release is tagged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions