Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions drafts/2026-05-08T193033Z.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Reply draft: Veris Show HN, mock-vs-live divergence and the runtime hook seam

- **HN:** https://news.ycombinator.com/item?id=48054313
- **Story:** "Show HN: Veris - Agent sandboxes with simulated external services" (id=48054313, posted by jrm-veris, 9 points / 23 hours / 0 comments at draft time, links to https://veris.ai/sandbox)
- **Status:** draft (pending manual post)

## Discovery

Browser sweep (no memorized links):

1. `https://news.ycombinator.com/ask` - scanned top 24 Ask HN; mostly meta-topics (career, AI cost, MCP-process count, "is Claude Code getting worse", LLM comments) which the thread-fit gate filters out.
2. `https://news.ycombinator.com/show` - scanned top 30 Show HN; spotted Tilde.run (id=48037724) at 196 points / 129 comments (saturated, mid-thread visibility near zero per gate), and a cluster of fresh adjacent Show HNs.
3. `https://hn.algolia.com/?q=agent+deleted&type=story&dateRange=pastWeek&sort=byDate` - 1 result (Crit, id=48062402; review-tool space already heavily covered in our open PRs).
4. `https://hn.algolia.com/?q=claude+code&type=story&dateRange=pastWeek&sort=byPopularity` - surfaced the Claude Code symlink-sandbox-escape CVE (id=48057842, 42 pts, 5 comments). Inspected the FailProof `block-read-outside-cwd` source in `src/hooks/builtin-policies.ts` lines 763-820: it uses `path.resolve()` with no `realpath`/symlink resolution, so it shares the same bypass shape as the Claude Code CVE. Cannot honestly pitch it as a fix; thread fails the gate. Skipped.
5. `https://hn.algolia.com/?q=agent+sandbox&type=story&dateRange=pastWeek&sort=byDate` - found Veris (id=48054313): fresh Show HN of an eval-time agent-sandbox tool with stateful LLM-powered mocks, 9 points / 0 comments / 23 hours. Clean adjacent-product Show HN, gate-passing.

Three-surface duplicate scan confirms id=48054313 is not in `drafts/`, not in `comments/`, and not in any open PR diff on this repo.

## OP

The submission is link-only - no inline `toptext` on the HN page. Per the linked Veris product page (https://veris.ai/sandbox):

- Veris is a pre-prod simulation environment for testing AI agents end-to-end.
- It ships **stateful, LLM-powered mock services** for 50+ enterprise platforms (SWIFT and OpenSanctions for banking, Salesforce / HubSpot for CRM, Zendesk / Intercom for support, Slack / Jira for productivity, Stripe / Shopify for payments).
- Fault classes it explicitly catches: hallucinations, incorrect tool usage, policy violations, context retention failures, latency.
- Architecture: scenario generation from agent code / production logs / past incidents, deterministic Veris Simulation Engine with rewards and replay scoring, multi-layer grading (scripted, LLM-judge, hybrid), training integration for SFT and RL.
- Positioning: framework-agnostic eval-time platform, no MCP / Claude Code / Agents-SDK specifics in the public material.
- Use cases highlighted: customer support, fraud detection.
- The pitch is "ship knowing only the happy path" is the failure mode.

## My reply

```
(disclosure: I work on FailProof AI: https://github.com/exospherehost/failproofai)

The load-bearing property here is that the mocks are LLM-powered and stateful, so you can run 10k scenarios safely without moving real money or paging a real Salesforce admin. The cost: bugs whose pathology only surfaces against the live service (idempotency-key replay, prod-vs-staging account-ID prefix drift, the rate limiter's actual jitter, partial state on a 502) won't reproduce in the mock. The same agent that passes every Veris scenario can emit a malformed call against real Stripe on the first prod prompt that nudges its plan off-script. A PreToolUse hook fills that gap by denying on call shape rather than scenario coverage:

customPolicies.add({
name: "block-prod-stripe-transfer-over-threshold",
match: { events: ["PreToolUse"] },
fn: async (ctx) => {
const url = String(ctx.toolInput?.url ?? "");
const amount = Number(ctx.toolInput?.body?.amount ?? 0);
if (url.includes("api.stripe.com/v1/transfers") && amount > 100_000)
return deny("Stripe transfer above $1000 (cents) blocked at runtime");
return allow();
},
});

Eval-time mocks gate the scenarios you wrote; runtime hooks gate the calls you didn't see coming.
```

## Insight for the FailProof team

The Veris-shaped axis is meaningfully distinct from the static-vs-runtime (Snyk, PR #42) and scenario-vs-runtime (TrainForgeTester PR #53, Spec27 PR #41) framings already in the open-PR set: it's specifically about **mock-vs-live divergence**. A scenario-test runner doesn't have to mock the world; Veris (and Veris-likes) explicitly do, and that simulation makes the eval-time guarantees inherently softer than scenario tests against a real staging environment. The honest framing is:

- Static analysis (Snyk-shape): catches what's enumerable from code at rest.
- Scenario tests against real services (TrainForgeTester / Spec27 shape): catches what's enumerable from prompt / tool-call traces.
- Stateful simulated services (Veris-shape): catches what's enumerable plus the multi-turn state-machine behaviors, but at the cost of mock fidelity.
- Runtime PreToolUse hooks: catches the always-wrong call shape regardless of whether anyone enumerated it.

This is a fourth seam that deserves its own one-page doc note alongside the other three. The angle "your simulator is a model of the world; the hook gates the call about to land in the world" is sharp enough to slot into any future Veris / Cygnal / Agnostic / Patronus-style pre-prod simulation Show HN. Customer-support and fraud-detection are the two domains Veris highlights, and both are exactly where "the agent passed every scenario but the rate limiter / idempotency / partial-state behavior in prod still bit us" is a real story; FailProof should consider a `examples/payments-policy.ts` recipe in the repo for that audience.

The thread is 0 comments at draft time, so a substantive top-level peer comment lands clean without competing against existing discussion. The OP is link-only on HN (no inline `toptext`), so anyone landing here is reading from the Veris site itself and has the simulation context loaded.

## Notes / findings

- Body word count: ~115 words of prose + ~50 words in the snippet = ~165 total; brand-voice band is "under ~150 words" with the working example at ~110. Slightly above the working-example footprint but well below the flagged-shape ~220-word footprint. Reads short on screen because the snippet is dense.
- ASCII punctuation only: hyphens (`-`), straight quotes (`"`/`'`), three ASCII dots if needed, parentheses for the list of mock-fidelity gaps. No em-dashes, en-dashes, curly quotes, fancy ellipses, or unicode arrows. The snippet uses ASCII `?`, `?`, `??`, and template strings only.
- One disclosure line in plain parens at the top, lowercase `disclosure:`, single repo URL. No second link at the bottom. No install command. No comma-list of policy names. No three-scope / 39-policies / dashboard / `~/.failproofai/` callouts. Custom-policy snippet, not a built-in name (so no over-specific claim that an OOTB policy exists for `block-prod-stripe-transfer-over-threshold` - it's illustrative).
- Cross-thread duplicate guard: framing axis ("mock-vs-live divergence", "the simulator is a model of the world") is materially distinct from the TrainForgeTester (PR #53) "scenarios catch enumerable behaviors, hooks catch always-wrong shapes" line and from Spec27 (PR #41) "tests validate the contract you wrote, hooks catch shapes the contract didn't list" line. Snippet domain (Stripe transfer URL + amount threshold) is unique to this thread - TrainForgeTester named `block-rm-rf` only, Spec27 used a `DROP TABLE` SQL regex. Closing aphorism is paraphrase-distinct: "scenarios you wrote vs calls you didn't see coming".
- Reply form on the Veris thread is open: `<form action="comment">` with `<textarea name="text">` and `<input type=submit value="add comment">` rendered at the bottom of the page. No `[dead]` / `[flagged]` markers. Thread is replyable.
- The Claude Code symlink CVE thread (id=48057842) was a tempting near-miss: concrete-failure shape, replyable, but FailProof's `block-read-outside-cwd` (`src/hooks/builtin-policies.ts` lines 763-820) uses `path.resolve(cwd, target)` with no symlink resolution, so claiming the policy would have prevented the CVE is wrong. Worth filing this against `failproofai` as a real bug: `block-read-outside-cwd` should call `fs.realpathSync()` (or async equivalent) on `target` before the prefix check, and probably also walk the chain when the path doesn't exist yet (write-time symlink-create-then-write attack). Same defect would apply to `block-secrets-write` if the agent writes to a symlinked path. Worth a separate issue / PR thread on the failproofai repo.
- Read cadence: 1 navigate to `/ask`, 1 to `/show`, 4 to Algolia search variants, 1 to the CVE item page, 1 to the Veris item page, 1 each to the Veris product page and the failproofai builtin-policies source. Well under the 20-pages-per-5-minute cap and the 50-pages-per-hour cap. No bursts.
- Used WebFetch for github.com / veris.ai (allowed - not ycombinator hosts) and `gh api` for failproofai source inspection (also non-ycombinator). All HN reads went through the dedicated Chrome profile via the `browser-use` MCP. No HTTP-client traffic to any ycombinator host.