Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
66767ee
Add AI-powered dataset populate with web search and CRUD tools
May 22, 2026
1429f0d
Address CodeRabbit review: authz, logging, timeouts, type alignment
May 22, 2026
04435f8
Move dataset ownership check inside try/catch for error handling
May 22, 2026
4de7ad7
Stabilize populate agent branch
giaphutran12 May 22, 2026
095c1b1
Add Mastra populate benchmark runtime
giaphutran12 May 22, 2026
9b6df1f
Ignore benchmark result artifacts
giaphutran12 May 22, 2026
fe55fc2
Add structured row recovery for Mastra populate
giaphutran12 May 22, 2026
72bb0ba
Add Mastra populate self-healing runtime
giaphutran12 May 22, 2026
823fa38
Wire Mastra populate through self-healing
giaphutran12 May 22, 2026
0efaf9d
Add self-healing populate cron runner
giaphutran12 May 22, 2026
17c4b97
Load self-healing cron context by dataset id
giaphutran12 May 22, 2026
21ca069
Add self-healing stack verifier
giaphutran12 May 22, 2026
f0d89b7
Document data collection agent migration plan
giaphutran12 May 22, 2026
1b2af8b
Add collection populate runtime adapter
giaphutran12 May 22, 2026
eeebdc4
Wire populate runtime selection
giaphutran12 May 22, 2026
6cacc56
Add collection self-healing benchmark lane
giaphutran12 May 22, 2026
aa4bb53
Refresh collection migration handoff plan
giaphutran12 May 22, 2026
346a20e
Address migration plan review gaps
giaphutran12 May 22, 2026
41767eb
Carry benchmark metadata through collection contract
giaphutran12 May 22, 2026
c2383b1
Load collection runner modules from runtime env
giaphutran12 May 22, 2026
ca90366
Port collection pipeline runner into self-healing path
giaphutran12 May 22, 2026
d476174
Harden collection runner wiring
giaphutran12 May 22, 2026
5d6a5f3
Bound collection agent runtime defaults
giaphutran12 May 22, 2026
4aaa209
Pass collection Agent timeout per run
giaphutran12 May 22, 2026
0f7c48e
Improve collection source targeting
giaphutran12 May 22, 2026
514591d
Surface collection capability diagnostics
giaphutran12 May 22, 2026
3cb4146
Document collection agent canary result
giaphutran12 May 22, 2026
cef8d39
Improve collection source coherence
giaphutran12 May 22, 2026
4265d23
Improve collection evidence support
giaphutran12 May 22, 2026
3348ae3
Fix collection URL-field source evidence
giaphutran12 May 22, 2026
c00eef8
Persist self-healing process traces
giaphutran12 May 22, 2026
8bafe26
Gate Playwright candidate readiness
giaphutran12 May 22, 2026
08bce46
Ingest collection browser action traces
giaphutran12 May 22, 2026
05f2e9b
Preserve Agent browser actions in reports
giaphutran12 May 22, 2026
c9f8438
Expose self-healing benchmark diagnostics
giaphutran12 May 22, 2026
f5a6e77
Gate benchmark runs on Playwright readiness
giaphutran12 May 23, 2026
25be451
Surface Agent run provenance diagnostics
giaphutran12 May 23, 2026
43cb7a3
Gate rejected self-healing benchmark candidates
giaphutran12 May 23, 2026
d2b4a75
Refresh current self-healing stack plan
giaphutran12 May 23, 2026
c9097a8
Cap self-healing row commits
giaphutran12 May 23, 2026
b92345f
Merge main into self-healing stack rollup
giaphutran12 May 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,19 @@ CLERK_SECRET_KEY=sk_test_...
# Generate at https://openrouter.ai/settings/keys
OPENROUTER_API_KEY=sk-or-...

# TinyFish — required by populate agent web search/fetch.
# Generate at https://agent.tinyfish.ai/api-keys
TINYFISH_API_KEY=

# Generate once after the first `make dev` with:
# docker compose exec convex ./generate_admin_key.sh
# Used by the backend container to call internal Convex functions.
CONVEX_SELF_HOSTED_ADMIN_KEY=

# Durable store for self-healing populate recipe manifests.
# Docker dev overrides this to /app/.bigset/populate-recipes on a named volume.
POPULATE_RECIPE_STORE_DIR=.bigset/populate-recipes

# PostHog (optional — leave blank to disable analytics entirely in local dev).
# Get from https://us.posthog.com/project/settings/general.
NEXT_PUBLIC_POSTHOG_KEY=
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
.DS_Store
node_modules/
backend/node_modules
.env
.env.local
Project_BigSet_brief.md
Expand All @@ -14,16 +15,18 @@ Project_BigSet_brief.md
*.log
npm-debug.log*
yarn-debug.log*
/benchmark-results/

# Local-only files
*.bak
tmp/
temp/

.mastra
.bigset/

# Local tarballs
*.tgz

# Internal docs
BigSet Technical Specs & Goals.md
BigSet Technical Specs & Goals.md
8 changes: 7 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Backend is Fastify + Mastra. Fastify serves the HTTP API (Clerk JWT auth on prot

The schema inference pipeline: frontend calls `POST /infer-schema` → Fastify verifies the Clerk JWT → calls `inferSchema()` in `backend/src/pipeline/schema-inference.ts` → Claude Sonnet 4.6 via OpenRouter → returns a Zod-validated `DatasetSchema` → frontend maps it to editable columns in the wizard.

The populate pipeline: frontend calls `POST /populate` with `{ datasetId, datasetName, description, columns }` → Fastify verifies the Clerk JWT → triggers `populateWorkflow` which: (1) clears existing rows, (2) builds a prompt from the schema, (3) runs the populate agent (Claude Sonnet 4.6) which searches the web via TinyFish APIs, then inserts rows into Convex one by one. Rows appear in realtime on the frontend via Convex reactive queries.
The populate pipeline: frontend calls `POST /populate` with `{ datasetId, datasetName, description, columns }` → Fastify verifies the Clerk JWT → runs the self-healing populate service. The service builds or reuses a recipe, runs the Mastra populate runtime against TinyFish search/fetch, validates source-backed rows, repairs bad recipes, promotes the passing recipe, then atomically replaces the dataset rows in Convex. Rows appear in realtime on the frontend via Convex reactive queries.

Convex functions use `ctx.auth.getUserIdentity()` to get the authenticated user. The `ownerId` field on datasets stores `identity.subject` (Clerk user ID). Do not pass `ownerId` from the client.

Expand All @@ -49,4 +49,10 @@ Convex is self-hosted — it does NOT hot-reload when you edit files in `fronten

In CI/prod, run `npx convex deploy` with `CONVEX_SELF_HOSTED_URL` and `CONVEX_SELF_HOSTED_ADMIN_KEY` set as env vars.

## Self-Healing Verification

Run `make verify-self-healing` before handing the stack to another agent. It runs backend tests, backend build, adapter syntax checks, and a no-key benchmark smoke that should block cleanly without spending API credits.

Use `bash scripts/verify-self-healing-stack.sh --real-benchmark` for the 2-prompt real Mastra benchmark, and `bash scripts/verify-self-healing-stack.sh --convex-push --dataset-id <dataset-id>` for a live app dataset dry-run. Export the required env vars before live modes; the verifier does not parse secret files itself. Add `--commit` only when you intentionally want to replace rows.

This is an open-source (AGPL) project. Do not commit secrets, API keys, or internal docs.
1 change: 1 addition & 0 deletions backend/.env.example
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
CLIENT_ORIGIN=http://localhost:3500
CONVEX_URL=http://localhost:3210
PORT=3501
POPULATE_RECIPE_STORE_DIR=.bigset/populate-recipes

# Required once the backend starts writing rows via internal Convex mutations.
# Generate with: docker compose exec convex ./generate_admin_key.sh
Expand Down
114 changes: 114 additions & 0 deletions backend/BigSet_Data_Collection_Agent/src/acquisition/link-follow.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
import type { FetchedPage } from "../models/schemas.js";
import type { WorkflowMemory } from "../memory/types.js";
import { domainMemoryBoost } from "../memory/workflow-memory.js";
import { getDomain, normalizeUrl } from "../utils/url.js";

const SKIP_HOST =
/(?:facebook|twitter|x\.com|instagram|youtube|tiktok|pinterest|reddit\.com\/r\/|linkedin\.com\/in\/|accounts\.google|login|signin|signup|register|cookie|privacy|terms|cdn\.|static\.|fonts\.)/i;
const SKIP_EXT = /\.(?:pdf|zip|png|jpe?g|gif|svg|webp|css|js|woff2?|xml|mp4|mp3)(?:\?|$)/i;
const POSITIVE_PATH =
/\/(?:blog|news|docs|documentation|pricing|billing|investor|investors|earnings|financial|reports|press|release|releases|mcp|model-context-protocol|agents|company|companies|startup|startups|portfolio|team|about|careers|jobs|directory|list|batch|founder|org|organization|profile|detail|view)(?:\/|$|\?)/i;
const NEGATIVE_PATH =
/\/(?:tag|tags|category|categories|author|feed|rss|search|wp-admin|wp-content)(?:\/|$|\?)/i;

export interface LinkFollowOptions {
pages: FetchedPage[];
excludeUrls: Set<string>;
focusFields?: string[];
maxTotal: number;
maxPerSource: number;
memory?: WorkflowMemory;
}

function pathTokensFromFields(fields?: string[]): string[] {
if (!fields?.length) return [];
return fields
.flatMap((field) =>
field
.split(/[_\s-]+/)
.map((part) => part.toLowerCase())
.filter((part) => part.length > 3),
)
.slice(0, 12);
}

function scoreLink(
link: string,
sourceDomain: string,
focusTokens: string[],
memory?: WorkflowMemory,
): number {
let score = 0;

try {
const parsed = new URL(link);
const host = parsed.hostname.toLowerCase();
const path = `${parsed.pathname}${parsed.search}`.toLowerCase();

if (SKIP_HOST.test(host) || SKIP_EXT.test(path)) return -1000;
if (NEGATIVE_PATH.test(path)) score -= 2;
if (POSITIVE_PATH.test(path)) score += 4;

const linkDomain = getDomain(link);
if (linkDomain === sourceDomain) score += 3;
else if (linkDomain.endsWith(`.${sourceDomain}`) || sourceDomain.endsWith(`.${linkDomain}`)) {
score += 2;
}

for (const token of focusTokens) {
if (path.includes(token)) score += 2;
}

if (memory) score += domainMemoryBoost(memory, linkDomain);

if (path.length > 120) score -= 1;
if (parsed.hash.length > 1) score -= 1;
} catch {
return -1000;
}

return score;
}

/** Pick outbound links from high-value pages using URL heuristics only. */
export function selectOutboundLinksToFollow(
options: LinkFollowOptions,
): string[] {
const focusTokens = pathTokensFromFields(options.focusFields);
const selected: string[] = [];
const selectedSet = new Set<string>();

const pagesWithLinks = options.pages
.filter((page) => !page.error && page.outbound_links && page.outbound_links.length > 0)
.sort((a, b) => (b.outbound_links?.length ?? 0) - (a.outbound_links?.length ?? 0));

for (const page of pagesWithLinks) {
const sourceUrl = normalizeUrl(page.final_url || page.url);
const sourceDomain = getDomain(sourceUrl);
let perSource = 0;

const ranked = [...(page.outbound_links ?? [])]
.map((link) => ({
link,
score: scoreLink(link, sourceDomain, focusTokens, options.memory),
}))
.filter((item) => item.score > 0)
.sort((a, b) => b.score - a.score);

for (const { link } of ranked) {
if (selected.length >= options.maxTotal) return selected;
if (perSource >= options.maxPerSource) break;

const normalized = normalizeUrl(link);
if (options.excludeUrls.has(normalized)) continue;
if (selectedSet.has(normalized)) continue;
if (normalized === sourceUrl) continue;

selectedSet.add(normalized);
selected.push(link);
perSource += 1;
}
}

return selected;
}
64 changes: 64 additions & 0 deletions backend/BigSet_Data_Collection_Agent/src/agents/agent-goal.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import { completeJson } from "../integrations/openrouter.js";
import {
memoryContextForAgents,
type WorkflowMemory,
} from "../memory/index.js";
import { agentGoalSchema, type AgentGoal } from "../models/schemas.js";
import type { DatasetSpec, SourceTriageResult } from "../models/schemas.js";

const AGENT_GOAL_SYSTEM = `You are the Navigation Task Agent for a web data collection pipeline.

Write a Tinyfish Agent goal: a clear natural-language instruction for browser automation on the given URL.

The agent must navigate the site and return structured JSON with extracted data matching the dataset schema.

Rules:
- Be specific about what to click, search, filter, or paginate.
- State the exact JSON shape to return: { "records": [ { column_name: value, ... } ] }
- Include column names from the schema in the goal.
- For forms: describe fields to fill and how to submit.
- For detail follow-up: explain how to open each item and which fields to collect.
- Limit scope (e.g. first 25 rows) to keep runs reliable.
- Do not invent data; extract only what is visible on the site.
- When workflow_memory is provided, reuse goal patterns from agent_goal_stats_top (high avg_completeness/confidence); avoid domains in domain_stats_weak unless diagnosis says otherwise.
- If latest_diagnosis.prefer_tinyfish_agent or agent_strategy_notes exist, follow them.
- Return ONLY JSON with fields: goal, rationale`;

export async function generateAgentGoal(options: {
userPrompt: string;
spec: DatasetSpec;
triage: SourceTriageResult;
focusFields?: string[];
memory?: WorkflowMemory;
}): Promise<AgentGoal> {
const columnList = options.spec.columns
.map((c) => `${c.name} (${c.type}${c.required ? ", required" : ""})`)
.join(", ");

return completeJson({
label: `agent_goal:${options.triage.final_url}`,
schema: agentGoalSchema,
messages: [
{ role: "system", content: AGENT_GOAL_SYSTEM },
{
role: "user",
content: JSON.stringify({
user_prompt: options.userPrompt,
triage_status: options.triage.status,
triage_reasoning: options.triage.reasoning,
suggested_action: options.triage.suggested_action,
page_url: options.triage.final_url,
page_title: options.triage.title,
row_grain: options.spec.row_grain,
columns: columnList,
focus_fields: options.focusFields ?? [],
extraction_hints: options.spec.extraction_hints,
workflow_memory: options.memory
? memoryContextForAgents(options.memory)
: undefined,
output_shape: { goal: "string", rationale: "string" },
}),
},
],
});
}
94 changes: 94 additions & 0 deletions backend/BigSet_Data_Collection_Agent/src/agents/benchmark-spec.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
import type { ColumnDef, DatasetSpec } from "../models/schemas.js";
import { normalizeSpecColumnOrder } from "./dataset-spec.js";

/** Benchmark harness fields from prompts.json (via env in adapters). */
export interface BenchmarkSpecContext {
promptId?: string;
promptQuality?: string;
persona?: string;
expectedStress?: string;
requiredColumns: string[];
}

export function hasBenchmarkRequiredColumns(
context?: BenchmarkSpecContext,
): context is BenchmarkSpecContext & { requiredColumns: string[] } {
return Boolean(context?.requiredColumns?.length);
}

/** Parse comma-separated column names (CLI flag or benchmark env). */
export function parseRequiredColumns(value: string): string[] {
const columns = value
.split(",")
.map((name) => name.trim())
.filter(Boolean);
if (columns.length === 0) {
throw new Error(
"Required columns must include at least one non-empty column name.",
);
}
return columns;
}

/**
* Ensures every benchmark-required column name exists on the spec as required.
* Types and descriptions come from the dataset-spec LLM when present; otherwise
* minimal placeholders (no per-column name heuristics).
*/
export function mergeSpecWithBenchmarkRequiredColumns(
spec: DatasetSpec,
context: BenchmarkSpecContext,
): DatasetSpec {
const requiredColumns = context.requiredColumns;
const columnsByName = new Map(spec.columns.map((column) => [column.name, column]));

const requiredColumnDefs: ColumnDef[] = requiredColumns.map((name) => {
const existing = columnsByName.get(name);
if (existing) {
return { ...existing, required: true };
}
return {
name,
type: "string",
description: name,
required: true,
};
});

const optionalExtras = spec.columns.filter(
(column) => !requiredColumns.includes(column.name),
);

const columns = [...requiredColumnDefs, ...optionalExtras];
const columnNames = new Set(columns.map((column) => column.name));

const isEntityLikeColumn = (name: string): boolean =>
/(entity|company|organization|business|restaurant|bakery|provider|product|name|title)/i.test(
name,
);

const dedupeKey =
requiredColumns.find(
(name) => columnNames.has(name) && isEntityLikeColumn(name),
) ??
spec.dedupe_keys.find((key) => columnNames.has(key)) ??
requiredColumns.find((name) => columnNames.has(name)) ??
spec.dedupe_keys[0];

const extractionHints = [
spec.extraction_hints,
`Benchmark required columns (use as exact row keys): ${requiredColumns.join(", ")}.`,
context.expectedStress
? `Benchmark stress note: ${context.expectedStress}`
: undefined,
]
.filter(Boolean)
.join("\n");

return normalizeSpecColumnOrder({
...spec,
columns,
dedupe_keys: dedupeKey ? [dedupeKey] : spec.dedupe_keys,
extraction_hints: extractionHints,
});
}
Loading