-
Notifications
You must be signed in to change notification settings - Fork 0
[claude-hackernews] Reply draft: TrainForgeTester Show HN, scenario-tests vs in-loop hook seam (id=48000135) #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
NiveditJain
wants to merge
1
commit into
main
Choose a base branch
from
luv-62
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| # Reply draft: TrainForgeTester Show HN, scenario-tests vs in-loop hook seam | ||
|
|
||
| - **HN:** https://news.ycombinator.com/item?id=48000135 | ||
| - **Story:** "Show HN: TrainForgeTester – deterministic scenario tests for AI agents" (id=48000135, posted by alcray, 2 points / 16 hours / 0 comments at draft time) | ||
| - **Status:** draft (pending manual post) | ||
|
|
||
| ## OP | ||
|
|
||
| > Hi guys, | ||
| > | ||
| > I have built TrainForgeTester, an open-source scenario test runner for AI agents that take actions (call tools). | ||
| > | ||
| > The idea: test how agents perform in company specific scenarios and not just on general benchmarks. More specifically test taking the wrong actions, skipping a required step, calling the wrong tool, or passing the wrong arguments. | ||
| > | ||
| > TrainForgeTester lets you run multi-turn scenarios (you create this scenarios based on your personal use case and data following the provided scenario schema) and check: | ||
| > | ||
| > * tool calls and arguments | ||
| > * strict or unordered tool execution | ||
| > * expected responses | ||
| > * regressions after model, prompt, or tool changes | ||
| > | ||
| > This scenario tester is the first part of the project (like v 0.1.0) | ||
| > | ||
| > I'm now working on the next part: a "scenario generator" that takes messy historical company data (customer support logs, agent traces, tool calls, transcripts, etc.) and turns them into testable scenarios for this framework. Again trying to make this as deterministic as possible | ||
| > | ||
| > Repo: https://github.com/TrainForge/TrainForgeTester | ||
| > | ||
| > I'd love feedback on: | ||
| > | ||
| > * real agent-testing use cases this does not cover yet (browser use, audio, video, mouse use) | ||
| > * whether this direction makes sense | ||
| > * where this could go as a product/devtool | ||
| > * issues, edge cases, or missing features in the repo | ||
|
|
||
| ## My reply | ||
|
|
||
| ``` | ||
| (disclosure: I work on FailProof AI: https://github.com/exospherehost/failproofai) | ||
|
|
||
| Scenario tests and an in-loop policy layer feel like complements with different remits. Scenarios test correctness ("for this prompt, the agent should call A then B not C") - they catch what you can enumerate. What they can't catch is the long tail in production: a regression introduces a new path the test didn't seed, and the agent reaches a destructive call you wouldn't have predicted. A PreToolUse hook is the catch-net for that tail; it intercepts based on the shape of the call about to fire, not on whether a matching scenario exists. Something like block-rm-rf denies any bash call whose text matches rm -rf regardless of which prompt got the agent there. Tests gate intended behaviors, hooks gate always-wrong call shapes - and shipping both layers together is more honest than either alone. | ||
| ``` | ||
|
|
||
| ## Insight for the FailProof team | ||
|
|
||
| The "test-time vs runtime" seam is the same conversation pattern that landed on the `Spec27` (id=47959984) and `AgentRQ` (id=47958608) threads, and it's worth canonizing as a one-paragraph FailProof framing for any agent-eval / scenario-testing Show HN that lands. The honest version is "tests are necessary but not sufficient": they catch enumerated regressions but can't prove the absence of bad call shapes on novel inputs. The runtime gate covers the unfalsifiable side. A short blog post titled something like "Scenario tests vs runtime policies for tool-using agents" (or similar) would slot directly into these threads without rewriting the framing each time, and the eval / test-runner space is going to keep growing - latitude-dev's Evals Skills (id=48006381) just landed today on /newest, and TrainForgeTester is hinting at a v0.2 "scenario generator" path that imports historical traces, which is exactly the surface where prod-derived scenarios meet runtime safety. | ||
|
|
||
| The OP is also explicitly soliciting "where this could go as a product/devtool" - someone on the FailProof side could DM-style email the maintainer about composing the two layers (TrainForgeTester runs scenarios; FailProof's policy fires inside the simulated tool call so you can assert "if a prompt mutates such that the agent ever calls rm -rf, the runtime gate denies it before the assertion runs"). That composition would be a real story to tell in both projects' READMEs. | ||
|
|
||
| ## Notes / findings | ||
|
|
||
| - 16-hour-old Show HN with 0 comments, so a measured first comment on the thread should land cleanly without competing against existing discussion. Per CLAUDE.md "thread-fit gate", this fits the "Show HN of an adjacent product (sandbox, gateway, hook manager, policy engine) where the OP solicits design discussion" path: test runners for tool-calling agents are squarely adjacent to FailProof's runtime hook layer, and the OP literally lists "where this could go as a product/devtool" as a feedback ask. | ||
| - Disclosure line is at the top, lowercased "disclosure:" inside parens, single repo URL in the disclosure, no second link at the bottom. ASCII punctuation only (hyphens, straight quotes, no em-dashes / curly quotes / unicode arrows). One policy named (`block-rm-rf`) tied directly to the OP's "wrong actions" framing - no comma-list of policies, no install command, no `~/.failproofai/` path callout, no three-scope or 39-policies talk, no dashboard plug. Body is ~135 words, in the working-shape band (`comments/2026-04-29T043958Z.md` was ~110, the flagged `drafts/2026-05-01T184439Z.md` was ~220). | ||
| - Adjacent-product thread cluster on HN this week is strong: AgentRQ (id=47958608, PR #40), Spec27 (id=47959984, PR #41), Snyk agent-scan (id=47999709, PR #42), latitude-dev Evals Skills (id=48006381, fresh on /newest, no PR yet). TrainForgeTester is the scenario-test counterpart to those static / scan / spec products, and the static-vs-runtime framing is the through-line. | ||
| - Sweep observation: `/newest` is heavy with low-engagement coding-agent / skill / memory Show HNs in the last 24h (Generate SKILL.md, Sync agent skills, See how much you spend per AI agent, Local semantic memory for coding agents, Stigmem federated knowledge fabric, Interpretable AutoResearch, TrainForgeTester) - mostly at 1-8 points and 0 comments. Of that batch, only the test-runner one (this thread) cleanly fits FailProof's gate; the rest are memory / skill / cost-tracking and would read as keyword-hunting if mentioned FailProof. | ||
| - Concrete-failure threads from past month are saturated in this repo: Cursor / Railway prod-volume delete is covered (PRs #15, #16, #19), AI-wants-to-nuke-DB is covered (PR #44), GTFOBins claude-code allowlist bypass is covered (PR #21). No fresh concrete-failure post is sitting unaddressed - the field has shifted to Show HN of adjacent products, which is what this draft hits. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language tag to the fenced block to satisfy markdown lint.
Line 37 opens a code fence without a language, which triggers MD040. Please mark it as plain text.
Suggested fix
Verify each finding against the current code and only fix it if needed.
In
@drafts/2026-05-04T104002Z.mdaround lines 37 - 41, The fenced code blockstarting with "(disclosure: I work on FailProof AI:
https://github.com/exospherehost/failproofai)" is missing a language tag
(MD040); update the opening fence from
totext so the block is explicitlymarked as plain text and the markdown linter stops flagging it. Ensure only the
opening triple backticks are changed to include "text" and leave the block
content and closing fence unchanged.