Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions drafts/2026-05-04T104002Z.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Reply draft: TrainForgeTester Show HN, scenario-tests vs in-loop hook seam

- **HN:** https://news.ycombinator.com/item?id=48000135
- **Story:** "Show HN: TrainForgeTester – deterministic scenario tests for AI agents" (id=48000135, posted by alcray, 2 points / 16 hours / 0 comments at draft time)
- **Status:** draft (pending manual post)

## OP

> Hi guys,
>
> I have built TrainForgeTester, an open-source scenario test runner for AI agents that take actions (call tools).
>
> The idea: test how agents perform in company specific scenarios and not just on general benchmarks. More specifically test taking the wrong actions, skipping a required step, calling the wrong tool, or passing the wrong arguments.
>
> TrainForgeTester lets you run multi-turn scenarios (you create this scenarios based on your personal use case and data following the provided scenario schema) and check:
>
> * tool calls and arguments
> * strict or unordered tool execution
> * expected responses
> * regressions after model, prompt, or tool changes
>
> This scenario tester is the first part of the project (like v 0.1.0)
>
> I'm now working on the next part: a "scenario generator" that takes messy historical company data (customer support logs, agent traces, tool calls, transcripts, etc.) and turns them into testable scenarios for this framework. Again trying to make this as deterministic as possible
>
> Repo: https://github.com/TrainForge/TrainForgeTester
>
> I'd love feedback on:
>
> * real agent-testing use cases this does not cover yet (browser use, audio, video, mouse use)
> * whether this direction makes sense
> * where this could go as a product/devtool
> * issues, edge cases, or missing features in the repo

## My reply

```
(disclosure: I work on FailProof AI: https://github.com/exospherehost/failproofai)

Scenario tests and an in-loop policy layer feel like complements with different remits. Scenarios test correctness ("for this prompt, the agent should call A then B not C") - they catch what you can enumerate. What they can't catch is the long tail in production: a regression introduces a new path the test didn't seed, and the agent reaches a destructive call you wouldn't have predicted. A PreToolUse hook is the catch-net for that tail; it intercepts based on the shape of the call about to fire, not on whether a matching scenario exists. Something like block-rm-rf denies any bash call whose text matches rm -rf regardless of which prompt got the agent there. Tests gate intended behaviors, hooks gate always-wrong call shapes - and shipping both layers together is more honest than either alone.
```
Comment on lines +37 to +41
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to the fenced block to satisfy markdown lint.

Line 37 opens a code fence without a language, which triggers MD040. Please mark it as plain text.

Suggested fix
-```
+```text
 (disclosure: I work on FailProof AI: https://github.com/exospherehost/failproofai)

 Scenario tests and an in-loop policy layer feel like complements with different remits. Scenarios test correctness ("for this prompt, the agent should call A then B not C") - they catch what you can enumerate. What they can't catch is the long tail in production: a regression introduces a new path the test didn't seed, and the agent reaches a destructive call you wouldn't have predicted. A PreToolUse hook is the catch-net for that tail; it intercepts based on the shape of the call about to fire, not on whether a matching scenario exists. Something like block-rm-rf denies any bash call whose text matches rm -rf regardless of which prompt got the agent there. Tests gate intended behaviors, hooks gate always-wrong call shapes - and shipping both layers together is more honest than either alone.
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>

[warning] 37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @drafts/2026-05-04T104002Z.md around lines 37 - 41, The fenced code block
starting with "(disclosure: I work on FailProof AI:
https://github.com/exospherehost/failproofai)" is missing a language tag
(MD040); update the opening fence from totext so the block is explicitly
marked as plain text and the markdown linter stops flagging it. Ensure only the
opening triple backticks are changed to include "text" and leave the block
content and closing fence unchanged.


</details>

<!-- fingerprinting:phantom:triton:hawk:f83972be-a659-450e-8be6-9b05a64bffc4 -->

<!-- d98c2f50 -->

<!-- This is an auto-generated comment by CodeRabbit -->


## Insight for the FailProof team

The "test-time vs runtime" seam is the same conversation pattern that landed on the `Spec27` (id=47959984) and `AgentRQ` (id=47958608) threads, and it's worth canonizing as a one-paragraph FailProof framing for any agent-eval / scenario-testing Show HN that lands. The honest version is "tests are necessary but not sufficient": they catch enumerated regressions but can't prove the absence of bad call shapes on novel inputs. The runtime gate covers the unfalsifiable side. A short blog post titled something like "Scenario tests vs runtime policies for tool-using agents" (or similar) would slot directly into these threads without rewriting the framing each time, and the eval / test-runner space is going to keep growing - latitude-dev's Evals Skills (id=48006381) just landed today on /newest, and TrainForgeTester is hinting at a v0.2 "scenario generator" path that imports historical traces, which is exactly the surface where prod-derived scenarios meet runtime safety.

The OP is also explicitly soliciting "where this could go as a product/devtool" - someone on the FailProof side could DM-style email the maintainer about composing the two layers (TrainForgeTester runs scenarios; FailProof's policy fires inside the simulated tool call so you can assert "if a prompt mutates such that the agent ever calls rm -rf, the runtime gate denies it before the assertion runs"). That composition would be a real story to tell in both projects' READMEs.

## Notes / findings

- 16-hour-old Show HN with 0 comments, so a measured first comment on the thread should land cleanly without competing against existing discussion. Per CLAUDE.md "thread-fit gate", this fits the "Show HN of an adjacent product (sandbox, gateway, hook manager, policy engine) where the OP solicits design discussion" path: test runners for tool-calling agents are squarely adjacent to FailProof's runtime hook layer, and the OP literally lists "where this could go as a product/devtool" as a feedback ask.
- Disclosure line is at the top, lowercased "disclosure:" inside parens, single repo URL in the disclosure, no second link at the bottom. ASCII punctuation only (hyphens, straight quotes, no em-dashes / curly quotes / unicode arrows). One policy named (`block-rm-rf`) tied directly to the OP's "wrong actions" framing - no comma-list of policies, no install command, no `~/.failproofai/` path callout, no three-scope or 39-policies talk, no dashboard plug. Body is ~135 words, in the working-shape band (`comments/2026-04-29T043958Z.md` was ~110, the flagged `drafts/2026-05-01T184439Z.md` was ~220).
- Adjacent-product thread cluster on HN this week is strong: AgentRQ (id=47958608, PR #40), Spec27 (id=47959984, PR #41), Snyk agent-scan (id=47999709, PR #42), latitude-dev Evals Skills (id=48006381, fresh on /newest, no PR yet). TrainForgeTester is the scenario-test counterpart to those static / scan / spec products, and the static-vs-runtime framing is the through-line.
- Sweep observation: `/newest` is heavy with low-engagement coding-agent / skill / memory Show HNs in the last 24h (Generate SKILL.md, Sync agent skills, See how much you spend per AI agent, Local semantic memory for coding agents, Stigmem federated knowledge fabric, Interpretable AutoResearch, TrainForgeTester) - mostly at 1-8 points and 0 comments. Of that batch, only the test-runner one (this thread) cleanly fits FailProof's gate; the rest are memory / skill / cost-tracking and would read as keyword-hunting if mentioned FailProof.
- Concrete-failure threads from past month are saturated in this repo: Cursor / Railway prod-volume delete is covered (PRs #15, #16, #19), AI-wants-to-nuke-DB is covered (PR #44), GTFOBins claude-code allowlist bypass is covered (PR #21). No fresh concrete-failure post is sitting unaddressed - the field has shifted to Show HN of adjacent products, which is what this draft hits.