Dataset Builder Agent Harness
MVP Decision
Start with a backend-owned planning harness before table creation gets clever.
The first shippable slice is:
- Turn a natural-language dataset request into a fixed schema.
- Ask missing-input questions before browser/form automation when possible.
- Prefer TinyFish Search + Fetch first.
- Escalate only hard pages or form flows to TinyFish Agent/browser automation.
- Validate cells against schema and source URL requirements.
- Replace values for the same identity row on refresh instead of appending duplicates.
History, user trust flags, public dataset resale, column editing, and Convex sync stay future scope.
Current Scaffold
backend/src/dataset-builder/types.ts defines dataset schema, plan, clarifying questions, harness stages, and run artifacts.
backend/src/dataset-builder/planner.ts creates a deterministic draft plan from a user request.
backend/src/dataset-builder/openrouter.ts optionally refines the draft through OpenRouter chat completions.
backend/src/dataset-builder/agent-harness.ts converts a plan into TinyFish agent goals and output schemas.
backend/src/dataset-builder/tinyfish-cli.ts is a local prototype adapter for TinyFish Search, Fetch, and Agent CLI runs.
backend/src/routes/dataset-builder.ts exposes POST /api/dataset-builder/plan behind Better Auth.
backend/src/schema.ts now has dataset and dataset_run metadata tables for plan/run storage.
Prototype Command
From backend/:
npm run builder:plan -- "restaurants in Menlo Park that serve Coca-Cola"
Use OpenRouter when a key is loaded:
npm run builder:plan -- "car insurance quotes for a 2021 Honda Civic in Menlo Park" --use-openrouter
The command prints the plan, generated TinyFish agent goal, and output schema. It never prints API keys.
API Contract
POST /api/dataset-builder/plan
{
"userRequest": "latest blog posts from my competitors",
"updateCadence": "daily",
"planningMode": "openrouter",
"providedInputs": {
"competitors": "exa.ai, perplexity.ai"
},
"preferredColumns": ["latest post URL"]
}
Response includes:
planId
plan
tinyFishAgentGoal
tinyFishAgentOutputSchema
Next Tickets
- Persist generated plans in
dataset.
- Add
POST /api/datasets/:id/runs to run the harness and write dataset_run artifacts.
- Decide if TinyFish execution should use direct HTTP APIs or CLI only for local experiments.
- Add a DB-backed queue/lease before cron refresh jobs.
- Add frontend create-dataset flow once Divya's table UI is ready.
Dataset Builder Agent Harness
MVP Decision
Start with a backend-owned planning harness before table creation gets clever.
The first shippable slice is:
History, user trust flags, public dataset resale, column editing, and Convex sync stay future scope.
Current Scaffold
backend/src/dataset-builder/types.tsdefines dataset schema, plan, clarifying questions, harness stages, and run artifacts.backend/src/dataset-builder/planner.tscreates a deterministic draft plan from a user request.backend/src/dataset-builder/openrouter.tsoptionally refines the draft through OpenRouter chat completions.backend/src/dataset-builder/agent-harness.tsconverts a plan into TinyFish agent goals and output schemas.backend/src/dataset-builder/tinyfish-cli.tsis a local prototype adapter for TinyFish Search, Fetch, and Agent CLI runs.backend/src/routes/dataset-builder.tsexposesPOST /api/dataset-builder/planbehind Better Auth.backend/src/schema.tsnow hasdatasetanddataset_runmetadata tables for plan/run storage.Prototype Command
From
backend/:npm run builder:plan -- "restaurants in Menlo Park that serve Coca-Cola"Use OpenRouter when a key is loaded:
npm run builder:plan -- "car insurance quotes for a 2021 Honda Civic in Menlo Park" --use-openrouterThe command prints the plan, generated TinyFish agent goal, and output schema. It never prints API keys.
API Contract
POST /api/dataset-builder/plan{ "userRequest": "latest blog posts from my competitors", "updateCadence": "daily", "planningMode": "openrouter", "providedInputs": { "competitors": "exa.ai, perplexity.ai" }, "preferredColumns": ["latest post URL"] }Response includes:
planIdplantinyFishAgentGoaltinyFishAgentOutputSchemaNext Tickets
dataset.POST /api/datasets/:id/runsto run the harness and writedataset_runartifacts.