diff --git a/drafts/2026-05-04T003422Z.md b/drafts/2026-05-04T003422Z.md new file mode 100644 index 0000000..63e0989 --- /dev/null +++ b/drafts/2026-05-04T003422Z.md @@ -0,0 +1,34 @@ +# HN reply draft: $38k AWS Bedrock runaway, hook-layer vs LLM-call-layer cap + +**Status:** draft (pending manual post) + +**HN:** https://news.ycombinator.com/item?id=47933355 +(top-level reply to OP, not a comment reply) + +**Story:** Self-post titled "$38k AWS Bedrock bill caused by a simple prompt caching miss" by user `Zephyr0x`, 5 days old, 8 points, 0 comments. + +**OP:** Coding-agent workflow (Droid -> LiteLLM -> Bedrock -> Claude Opus 4.6) ran for an extended period with prompt caching mostly missing; final bill was ~$37.9k, of which ~$35.6k was uncached input tokens (~6.47B tokens) compounding across many turns. OP frames this as a runaway-cost failure mode of the platform, lists what they wish existed (IAM principal monthly cap, per-model call cap, per-workflow uncached-token cap, hard stop at budget cross), and asks: "Has anyone here built reliable guardrails for this? IAM deny rules? API gateways? token-budget proxies? per-workflow kill switches?" + +## The post (summary) + +OP put a metered Opus model behind a daily local coding-agent workflow without verifying that prompt caching was actually working through every layer of the chain. Caching support is advertised at every hop (Anthropic, Bedrock, LiteLLM, Droid) but nothing in the chain told them the actual cache hit rate was ~25% of input tokens. By the time their AWS bill came in the agent had already shipped ~6.5B uncached input tokens. Their argument: budget alarms / credits / "supported" checkboxes are soft signals dressed up as safety boundaries; for unattended agents at metered LLM prices, you need a hard stop. They explicitly call out four mechanisms they would like (IAM principal $/month cap, per-model call/day cap, per-workflow uncached-tokens/hour cap, hard stop on budget crossing) and ask the thread for actual implementations. + +## My reply + +``` +(disclosure: I work on FailProof AI: https://github.com/exospherehost/failproofai) + +The expensive line item was uncached input tokens compounding inside the LLM call, so the cap has to live at the layer that sees those tokens: LiteLLM with `max_input_tokens` per route, or an IAM Bedrock rate cap. Budget alarms run after the fact. Claude Code's hook layer (where FailProof sits) only sees the tool-call seam: Bash, Read, Write, MCP, etc. It cannot reason about token spend on a single Bedrock call. Where the hook layer does help is the per-workflow kill switch you listed: a custom PreToolUse policy that counts tool invocations against a per-session ceiling and denies past it. That bounds how many turns a runaway can attempt; it does not bound a single megaturn that ships 5GB of context. +``` + +## Insight for the FailProof team + +The OP frames their problem as "guardrails missing from the platform," but the four mechanisms they list span two distinct layers: (a) **LLM-call layer** (IAM principal $/month, per-model call/day, per-workflow uncached-tokens/hour) and (b) **agent-workflow layer** (per-workflow kill switch). FailProof addresses (b) cleanly via PreToolUse counting and Stop policies; it has *nothing* to say about (a). When prospects ask "why isn't FailProof a token-budget guardrail?" we should answer this exact way every time: "FailProof sits on the agent's tool-call seam, not on the LLM-call seam. For LLM-call costs, use a LiteLLM-style proxy or an IAM rate cap." Worth a short blog post: a 2x2 of (cost runaway vs destructive op) x (LLM layer vs tool layer) to position FailProof's actual surface vs the proxy/IAM surface so we never get pulled into pretending we solve token-budget problems. Also: most "agent went rogue" stories on HN are tool-layer stories (rm -rf, drop database, force push); this is the rare one where the loss happens entirely upstream of the tool surface, and that distinction is genuinely useful framing in posts and demos. + +## Notes / findings + +- The thread is 5 days old, 8 points, no comments yet. Reply form is open. Visibility is low; OP gets notified, broader audience minimal. This is fine - the comment is shaped as targeted help to the OP, not as broadcast pitch. +- OP's stack (Droid -> LiteLLM -> Bedrock -> Opus 4.6) is the clue that prompt caching went bad: Bedrock's prompt caching has TTL and cache-point semantics that LiteLLM's pass-through can quietly violate when prompts shift slightly across turns; LiteLLM's `cache_control` on the wrong block is the usual culprit. Worth not litigating in the reply but noting here for follow-up engagement if OP responds. +- Strict gate check: this thread is at the LLM-call layer rather than the tool-call layer FailProof addresses. The reply is shaped to acknowledge that mismatch upfront ("the layer that sees those tokens") and only mentions FailProof for the narrow per-workflow kill switch slice OP explicitly listed. No install commands, no policy-name comma list, no feature/version talk, no dashboard plug, single repo link in disclosure. Body is ~138 words. +- ASCII punctuation throughout: hyphens in compound nouns (per-workflow, per-session, tool-call), colons for the "here is the layer" structure, semicolon for the "bounds X; does not bound Y" contrast. No em-dashes, en-dashes, fancy ellipses, curly quotes, or unicode arrows. +- Cross-thread duplicate guard: searched local `drafts/` and `comments/` for `item?id=47933355` (none) and scanned all open PR bodies via `gh pr view --json body` for the same string (none). Body content does not paraphrase any prior FailProof reply in this repo - the LLM-call vs tool-call layer framing is unique to this thread.