Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions benchmarks/dataset-agent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,37 @@ benchmark stays cheap and bounded. Set `COLLECTION_AGENT_ENABLE_AGENT=true` to
opt in; Agent polling is capped by `AGENT_POLL_TIMEOUT_MS`, or by
`COLLECTION_AGENT_POLL_TIMEOUT_MS` when the generic timeout is unset.

When Agent is off and triage finds browser/form/detail-page follow-up, the
collection runner emits a non-fatal capability diagnostic. Healthy rows can
still pass self-healing validation with this diagnostic as a warning. Benchmark
failures show the same diagnostic as the failure message so the result says
"turn Agent on for this prompt" instead of pretending the run hit auth,
credits, or generic zero-row failure.

Use this canary when checking whether Agent/browser follow-up fixes the current
source-evidence misses:

```bash
COLLECTION_AGENT_ENABLE_AGENT=true \
COLLECTION_AGENT_POLL_TIMEOUT_MS=480000 \
COLLECTION_AGENT_PIPELINE_MODULE=./backend/BigSet_Data_Collection_Agent/src/orchestrator/pipeline.ts \
BIGSET_COLLECTION_BENCHMARK_RUNNER_MODULE=./backend/src/pipeline/collection-agent-runner.ts \
node benchmarks/dataset-agent/run-benchmark.mjs \
--prompt-ids mcp-docs-pages \
--timeout-ms 900000 \
--system collection-self-heal='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/collection-self-healing-adapter.mjs'
```

Latest `mcp-docs-pages` Agent-enabled canary evidence:

- artifact: `benchmark-results/collection-agent-canary-mcp-20260523-001`
- status: failed, not blocked
- rows/evidence: 3 rows, 12 evidence quotes, 10 source URLs
- cost: about `$0.053552`
- signal: Agent runs complete and claim support reaches `1.0`, but domain
accuracy stays `0.667`; next fix is source/domain coherence, not more Agent
plumbing.

App and CLI collection-runtime runs use the same runner shape, but load it from
`POPULATE_COLLECTION_RUNNER_MODULE` when `POPULATE_AGENT_RUNTIME=collection`.

Expand Down
120 changes: 82 additions & 38 deletions docs/data-collection-agent-migration-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,19 @@ the collection pipeline is migrated into BigSet.
- PR #41 adds a `collection-self-heal` benchmark lane that wraps the collection
runtime inside `SelfHealingPopulateRecipeService`. This is the benchmark
socket Meteor can use once the real collection runner is available.
- `feat/data-collection-agent-v14` vendors the collection pipeline under
`backend/BigSet_Data_Collection_Agent` and includes the memory module.
- Clean `feat/data-collection-agent-v14` tests pass once ignored backend
dependencies are present, but `npm --prefix backend run build` still fails on
TypeScript/API integration issues:
- TinyFish run status is typed too narrowly.
- OpenRouter provider return type leaks private declaration details.
- Backend compile depends on generated frontend Convex API output.
- AI SDK `maxTokens` option no longer matches the installed SDK type.
- PR #43 ports the real vendored collection pipeline behind
`runCollectionPopulatePipeline(input)`, so the collection benchmark lane now
runs the BigSet-wrapped collection runner instead of a fake injected runner.
- PR #44 keeps TinyFish Agent/browser work opt-in and bounded by a per-run poll
timeout. This preserves cheap cron/benchmark reruns as the default path.
- PR #45 improves collection source targeting for official-source prompts
without injecting answer-key URLs at runtime.
- PR #46 surfaces no-Agent browser/form/detail follow-up as a safe capability
diagnostic instead of hiding it as generic bad data or infra failure.
- `feat/data-collection-agent-v14` is no longer the branch to build on directly.
It was the source of the collection pipeline port. New work should branch on
top of the current draft stack, not edit Meteor's branch or the dirty main
checkout.

## Target Shape

Expand Down Expand Up @@ -77,25 +81,30 @@ The current layer now can:

- run an injected collection runner through the same self-healing runtime
boundary and benchmark harness as Mastra
- run the real vendored collection pipeline through that same boundary
- preserve `recipe.runtimeInstructions`, required columns, and benchmark
metadata through the collection runner
- emit a capability diagnostic when no-Agent mode sees pages that need browser,
form, or detail-page follow-up

The current layer does not yet:

- run the real vendored collection pipeline as its runtime in this stack
- generate Playwright scripts as a durable production recipe
- run a green live Convex canary in this local environment
- prove quality on a full real benchmark for the collection runtime
- prove Agent-enabled collection quality on a full real benchmark
- prove the collection runtime should replace Mastra as the default app runtime

## Migration Sequence

1. Branch from the top of the self-healing stack.
- For any new collection-runner work, base on
`codex/collection-self-healing-benchmark` so PR #39, #40, and #41 stay in
the path.
- Do not edit `main` or `feat/data-collection-agent-v14` directly.
- For new collection-runner or benchmark work, base on
`codex/collection-capability-diagnostics` unless that PR has been
superseded.
- Do not edit `main`, the dirty local checkout, or
`feat/data-collection-agent-v14` directly.

2. Fix the collection branch as a clean build source.
- Port only the needed collection pipeline files into the fresh branch.
- Fix the TypeScript/API issues listed above.
- Status: done in PR #43 for the BigSet-wrapped collection runner path.
- Keep vendored code isolated until the adapter is green.
- Preserve the current backend Convex boundary: do not reintroduce imports
from `frontend/convex/_generated` into backend compile. Use the existing
Expand Down Expand Up @@ -142,6 +151,8 @@ The current layer does not yet:
6. Run quality gates in increasing cost order.
- `make verify-self-healing`
- 2-prompt real benchmark
- 1-prompt Agent-enabled capability canary for prompts that need browser or
detail follow-up
- full benchmark only after the 2-prompt run is not obviously broken
- live `--dataset-id` dry-run only after Convex/env prerequisites are ready
- `--commit` only on a throwaway dataset first
Expand Down Expand Up @@ -177,6 +188,9 @@ Before any merge:
- benchmark evidence comes from the collection runtime wrapped inside the
self-healing service, not the direct collection pipeline alone
- real benchmark artifacts are linked in the PR when runtime quality is claimed
- capability diagnostics are treated as warnings for healthy rows and as honest
benchmark failure messages when no-Agent mode cannot complete browser/form
follow-up
- live dataset commit is tested only on a throwaway dataset
- backend build does not depend on `frontend/convex/_generated`

Expand Down Expand Up @@ -205,29 +219,59 @@ node benchmarks/dataset-agent/run-benchmark.mjs \
--system collection-self-heal='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/collection-self-healing-adapter.mjs'
```

For prompts that likely require browser/detail follow-up, run the same lane with
Agent explicitly enabled:

```bash
COLLECTION_AGENT_ENABLE_AGENT=true \
COLLECTION_AGENT_POLL_TIMEOUT_MS=480000 \
COLLECTION_AGENT_PIPELINE_MODULE=./backend/BigSet_Data_Collection_Agent/src/orchestrator/pipeline.ts \
BIGSET_COLLECTION_BENCHMARK_RUNNER_MODULE=./backend/src/pipeline/collection-agent-runner.ts \
node benchmarks/dataset-agent/run-benchmark.mjs \
--prompt-ids mcp-docs-pages \
--timeout-ms 900000 \
--system collection-self-heal='node --import ./backend/node_modules/tsx/dist/esm/index.mjs benchmarks/dataset-agent/adapters/collection-self-healing-adapter.mjs'
```

No-Agent `mcp-docs-pages` evidence from PR #46:

- artifact: `benchmark-results/collection-capability-diagnostics-mcp-20260523-001`
- result: 3 rows, 6 evidence quotes, cost about `$0.007287`
- status: failed with
`Capability diagnostic: TinyFish Agent disabled; triage requested browser/form/detail follow-up...`.
That is not a pass, but it is useful: it tells us the next benchmark should
turn Agent on and measure whether browser/detail follow-up fixes the source
evidence miss.

Agent-enabled `mcp-docs-pages` evidence from the stack-handoff branch:

- artifact: `benchmark-results/collection-agent-canary-mcp-20260523-001`
- result: 3 rows, 12 evidence quotes, 10 source URLs, 3 Agent runs
- cost: about `$0.053552`
- status: failed, not blocked
- score: factual accuracy `0.933`, entity coverage `1.0`, claim support `1.0`,
domain accuracy `0.667`
- conclusion: Agent/browser follow-up runs successfully and improves claim
support, but source/domain evidence still misses. The next code target is
source coherence: keep each row's docs URL/evidence/source URLs aligned with
that entity's official docs domain instead of merging discovery/blog/course
evidence across vendors.

## Next Engineering Move

Create a fresh branch from `codex/collection-self-healing-benchmark` and port the
real collection runner behind the existing adapter boundary:

1. Add a runner module, likely `backend/src/pipeline/collection-agent-runner.ts`,
that exports `runCollectionPopulatePipeline(input)`.
2. Port only the collection pipeline files needed by that runner from
`feat/data-collection-agent-v14`.
3. Convert `CollectionPopulatePipelineInput` into the collection pipeline's
prompt/spec. Include `input.prompt`, `input.recipeInstructions`,
`input.requiredColumns`, prompt id/quality, persona, and expected-stress
benchmark context when available.
4. Convert the collection pipeline output into `PopulateRuntimeResult`: rows,
source URLs, evidence quotes, usage, metrics, and debug captured sources.
5. Keep Convex writes, auth, cron scheduling, and durable recipe storage outside
the collection runner.
6. Fix build blockers while porting: TinyFish status typing, OpenRouter provider
declaration leak, backend dependency on generated frontend Convex API, and
AI SDK `maxTokens`.
7. Gate in this order: `npm --prefix backend test`, `npm --prefix backend run
build`, `make verify-self-healing`, 2-prompt `collection-self-heal`
benchmark, then full benchmark only if the 2-prompt run is not obviously
Create a fresh branch from `codex/collection-capability-diagnostics` and fix
source coherence before running the full benchmark:

1. Keep `COLLECTION_AGENT_ENABLE_AGENT=false` as the default.
2. Add focused tests around record merge/source selection so a row does not gain
evidence for a populated field from another record unless the incoming row
value supports the existing value.
3. Tighten docs/official-source selection so docs prompts prefer docs/developers
pages over blogs, news, courses, directories, or third-party discovery pages.
4. Re-run the Agent-enabled `mcp-docs-pages` canary.
5. If domain accuracy reaches `1.0`, run the 4-prompt focused benchmark from
PR #45.
6. Run the full prompt pack only after the focused benchmark is not obviously
broken.

When testing the real app or CLI path, set:
Expand Down