Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions drafts/2026-05-08T212431Z.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
**HN:** https://news.ycombinator.com/item?id=48046023 (parent: https://news.ycombinator.com/item?id=48051949)

**Story / OP:** Show HN: Agent-skills-eval - Test whether Agent Skills improve outputs (link-only submission to https://github.com/darkrishabh/agent-skills-eval; no toptext on the HN post). 1 day old, 72 points, 36 comments.

**Status:** draft (pending manual post)

**The post:**

OP is a link-only Show HN of an eval harness that A/B tests whether Claude Code Skills improve outputs over the baseline. No body text on the HN submission itself. The substantive sub-thread is reedlaw's chain about hooks not being able to enforce routing (parent comment below).

**Parent comment (reedlaw, id=48051949), verbatim:**

> I tried to create a hook that would detect when token usage was running out and write HANDOFF.md so I could switch to another agent and finish the current task. It never worked reliably. To make a hook for db queries, it would need to run before each bash call, check if it looks like a query, and then exit with a new prompt, e.g.: "Use tidewave's execute_sql_query for DB access". But then it could just ignore the prompt the same as CLAUDE.me. What if I really wanted to use bash for a specific task? The real issue is that prompts are not tightly coupled with capabilities. If we admit that, then skills are over hyped.

(The grandparent, reedlaw's id=48050489, gives the concrete failure: Opus 4.7 ignores a 720-byte CLAUDE.md telling it to route DB queries through tidewave's MCP, and instead does `Bash(DATABASE_URL=$(grep ... .env) echo "ok")`.)

**My reply:**

```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to the fenced code block.

Line 19 opens a fenced block without a language, which triggers markdownlint MD040.

Suggested fix
-```
+```text
 (disclosure: I work on FailProof AI: https://github.com/exospherehost/failproofai)
 ...
-```
+```
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 19-19: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@drafts/2026-05-08T212431Z.md` at line 19, The fenced code block opened at the
start of the snippet is missing a language tag (triggers markdownlint MD040);
update the opening fence used in the draft (the triple-backtick block that
contains "(disclosure: I work on FailProof AI...)" ) to include an explicit
language tag such as text (e.g., change ``` to ```text) so the block is properly
tagged and MD040 is resolved.

(disclosure: I work on FailProof AI: https://github.com/exospherehost/failproofai)

The "ignore the prompt the same as CLAUDE.md" part is where deny() and instruct() diverge. instruct(msg) injects guidance and lets the tool call proceed; deny(msg) returns a tool-error to the agent, so the bash literally does not run and the model has to pick another path. For "agent routes around tidewave" the shape you want is deny, not instruct. Roughly:

customPolicies.add({
name: "route-db-via-tidewave",
match: { events: ["PreToolUse"] },
fn: async (ctx) => {
if (ctx.toolName !== "Bash") return allow();
const cmd = ctx.toolInput?.command ?? "";
if (/\b(psql|DATABASE_URL|SELECT |INSERT |UPDATE |DELETE )/i.test(cmd))
return deny("Use tidewave's execute_sql_query MCP tool for DB access.");
return allow();
},
});

"What if I really do want bash" is just another branch in the regex. Capability coupling lives in the hook return value rather than in the prompt text.
```

**Insight for the FailProof team:**

The reedlaw thread is the cleanest articulation I've seen on HN of the *routing* (vs. blocking) use case for PreToolUse - "ignored CLAUDE.md => need a hook that intercepts the bypass path and force-redirects to the MCP tool". That's a different conversation from the destructive-ops gating story we usually pitch. Worth a short blog post: "deny vs instruct: when to fail-the-tool-call vs when to inject-and-continue" with this exact tidewave routing example. Also: rirze (id=48051665) pushes back that hooks are hard "since the default approach it's using is call the URL directly" - which conflates hook-availability with hook-coverage. The point is the hook *can* match on URL-shaped Bash invocations too; pattern-match the agent's actual call, not the platonic call. A second blog beat. Both pieces would slot naturally into the next time someone on HN asks "how do I keep agent X from doing Y instead of Z."

**Notes / findings:**

- Thread is 1 day old, last comment 22 minutes before drafting - reply window is wide open.
- The grandparent (reedlaw id=48050489) is a concrete-failure-mode comment, exactly the shape the thread-fit gate accepts. No pitch-vibe risk if the reply stays on the deny-vs-instruct distinction.
- Show HN around it is about Skills evaluation, not policy enforcement, so I am replying mid-thread to a sub-conversation about hooks rather than at the top level. This stays on-topic for the sub-thread (hooks for routing) without hijacking the OP's product (skills eval).
- ASCII-only check: no em-dashes, en-dashes, fancy ellipses, curly quotes, or unicode arrows in the reply body.
- Reply form on /reply?id=48051949 returns the login wall for the unauthenticated profile, as expected. Posting happens on the user's side.
- Cross-thread duplicate guard: deny()-vs-instruct() framing has not appeared in earlier drafts. Earlier PRs covered transport-vs-hook (Lilith), MCP-surface-vs-PreToolUse (Faz), Docker-vs-intent (Armorer), workflow-shape-vs-invariant (BetterClaw), etc. - none of them led with the deny-as-tool-error semantics.