Three-tier populate agent: triage-extract + investigate subagents#91
Three-tier populate agent: triage-extract + investigate subagents#91MMeteorL wants to merge 12 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
First batch: 5 searches → top 5 URLs → 5 parallel extract_rows. Subsequent batches: up to 20 searches (from leads + new angles) → top 10 URLs → up to 10 parallel extract_rows → list_rows → repeat. Mirrors the original investigate_row dispatch discipline but scaled for the triage-extract architecture. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR refactors the dataset population pipeline from a single investigate-row flow into a tri-agent orchestration: a populate orchestrator dispatches URL batches to an extract agent that inserts/updates rows, while a list_rows tool tracks completeness and an investigate_entity tool fills missing columns in parallel. The extract tool suite introduces per-run shared row indexing, confidence-aware merging, and deduplication via a pendingInserts set. Two new Convex mutations provide atomic per-field merging (mergeUpdate) and incomplete-row cleanup (deleteIncomplete). A new model-middleware module repairs double-encoded JSON in tool-call arguments via global JSON.parse patching and per-stream interception. The populate agent now accepts a targetRows cap (default 20), enforced in the extract_rows tool, and the background handler prunes incomplete rows after population completes. Search results optionally include site_name, and comprehensive documentation updates describe the new architecture and merge semantics. Sequence Diagram(s)sequenceDiagram
participant Orchestrator
participant SearchWeb
participant ExtractRows
participant ExtractAgent
participant RowIndex as RowIndex<br/>(in-memory)
participant Convex
participant ListRows
participant InvestigateEntity
participant InvestigateAgent
loop Per iteration until done
Orchestrator->>SearchWeb: parallel searches
SearchWeb-->>Orchestrator: leads with URLs
Orchestrator->>ExtractRows: batch promising URLs
ExtractRows->>RowIndex: refresh from Convex
ExtractRows->>ExtractAgent: extract from batch
ExtractAgent->>Convex: batch_insert_rows (new/merges)
Convex->>RowIndex: sync inserted/merged rows
Orchestrator->>ListRows: report progress
ListRows-->>Orchestrator: completed vs incomplete count
par Investigate incomplete rows
Orchestrator->>InvestigateEntity: research missing columns
InvestigateEntity->>RowIndex: get row context
InvestigateEntity->>InvestigateAgent: fill via search
InvestigateAgent->>Convex: update_row_by_key (confidence merge)
Convex->>RowIndex: sync updated cells
end
end
Orchestrator-->>Orchestrator: stop (targetRows reached or stagnant)
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/src/mastra/agents/investigate.ts (1)
10-65:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftThis refactor breaks the repo’s investigate-subagent contract.
The instructions and tool wiring now enforce an
update_row_by_key-only agent, but this file is still documented to expose the four-tool investigate surface (insert_row,list_rows,search_web,fetch_page) frombuildInvestigateAgent(authorizedDatasetId, authContext, columns). If this contract is intentionally changing, the repository guideline and any dependent callers/tests need to move in the same PR; otherwise this agent should keep the required interface.
As per coding guidelines, "Use buildInvestigateAgent(authorizedDatasetId, authContext, columns) factory to build per-entity subagents with captured dataset id for capability scoping" and "Investigate subagent must have exactly 4 tools:insert_row,list_rows,search_web,fetch_pageand must return structured feedback (INSERTED/SUMMARY/CLUES/REASON)".Also applies to: 79-93
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/mastra/agents/investigate.ts` around lines 10 - 65, The refactor changed the investigate agent's behavior to only call update_row_by_key, breaking the required investigate-subagent contract; restore the original 4-tool interface and structured feedback by updating buildInvestigateInstructions and the agent factory to document and expose the four tools (insert_row, list_rows, search_web, fetch_page) and ensure the agent implementation created by buildInvestigateAgent(authorizedDatasetId, authContext, columns) uses those tools (not just update_row_by_key) and returns the exact final structured output (INSERTED/SUMMARY/CLUES/REASON); if the contract is intentionally changing, instead update all dependent callers/tests and repository guidelines consistently to reflect the new single-tool behavior but do not leave this file inconsistent.
🧹 Nitpick comments (1)
backend/package.json (1)
30-30: Confirmtypescript@5.8.3validity/security; flag non-security compiler regressions
typescript@5.8.3is a published npm version; with^5.8.3it will stay on the5.xline (won’t float to6.x).- No known security vulnerabilities/CVEs are reported for
typescript@5.8.3in common vulnerability feeds.- Reported issues for TS 5.8.3 include compiler/typecheck crashes/regressions (reliability, not security); if CI/build failures occur, consider upgrading within the
5.xline.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/package.json` at line 30, The package.json currently specifies "typescript": "^5.8.3"; confirm this version is acceptable by (1) checking official npm/CVE/advisory feeds for any reported security vulnerabilities for typescript@5.8.3 and documenting that none were found, (2) running the project CI/build and full test/typecheck to detect any compiler/regression issues caused by 5.8.3, and (3) if build/typecheck failures occur, update package.json to a newer safe 5.x release (and update lockfile) rather than jumping to 6.x; if everything passes, leave "^5.8.3" but note in the PR that regressions are a known non-security risk.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/src/env.ts`:
- Around line 28-31: POPULATE_TARGET_ROWS is being set via
Number(process.env.BIGSET_POPULATE_TARGET_ROWS||"20") without validation, which
can yield NaN or non-positive values; update the initialization so you parse and
validate BIGSET_POPULATE_TARGET_ROWS (use parseInt or Number), ensure
Number.isInteger(value) and value > 0, and if invalid either fallback to the
default 20 or throw a clear error/log (modify the POPULATE_TARGET_ROWS
assignment and add validation logic around
process.env.BIGSET_POPULATE_TARGET_ROWS to enforce a positive integer).
In `@backend/src/mastra/agents/populate.ts`:
- Around line 12-72: The current orchestrator text in
buildOrchestratorInstructions changes the agent surface by referencing
extract_rows and list_rows and delegating fetches/writes; revert it to the
repository contract by using buildPopulateAgent(authorizedDatasetId,
authContext, columns) to construct the orchestrator and ensure the instructions
only reference the three authorized tools: search_web, fetch_page, and
investigate_row (remove references to extract_rows, list_rows, and any
delegation that performs fetches or writes); update any logic or stop-condition
wording to rely on list_rows/fetch_page/investigate_row behavior prescribed by
buildPopulateAgent and preserve the search-only boundary.
In `@backend/src/mastra/tools/investigate-tool.ts`:
- Around line 155-183: The insert path can overshoot targetRows because the
extract_rows check runs elsewhere; inside the execute handler for insert_row
(the async execute with params { primary_key, confidence, sources, data }) add a
guard right before calling convex.mutation(internal.datasetRows.insert, ...)
that checks the current completed count (use rowIndex.size or the same
completed-row counter you maintain) against targetRows and aborts the insert if
the cap is reached (log via logCtx and return a failure/skip result). Ensure you
perform this check before the mutation and only call rowIndex.set(primary_key,
...) after a successful insert; reference: execute, primary_key, targetRows,
rowIndex, convex.mutation(internal.datasetRows.insert), authorizedDatasetId.
- Around line 431-452: The investigate subagent is being spawned with
buildInvestigateAgent and limited to maxSteps: 20; update the launch to use
buildInvestigateTool(authorizedDatasetId, authContext, columns) and call its
generate with maxSteps: 25 instead of 20, then pass the resulting text into
parseInvestigateOutput as before (ensure you're replacing buildInvestigateAgent
and the agent.generate(..., { maxSteps: 20 }) call with the
buildInvestigateTool(...) invocation and generate(..., { maxSteps: 25 }) so the
subagent conforms to the required 25-step budget).
In `@frontend/convex/datasetRows.ts`:
- Around line 199-202: The completeness check in isComplete currently tests
every name in args.columnNames against data, but the contract excludes fields
starting with '_' from being required; update the logic in the isComplete
computation to first filter args.columnNames to exclude any column name that
starts with '_' (e.g., col.startsWith('_')) and then run the every(...) check
against data so underscore-prefixed fields are never considered required when
determining completeness.
---
Outside diff comments:
In `@backend/src/mastra/agents/investigate.ts`:
- Around line 10-65: The refactor changed the investigate agent's behavior to
only call update_row_by_key, breaking the required investigate-subagent
contract; restore the original 4-tool interface and structured feedback by
updating buildInvestigateInstructions and the agent factory to document and
expose the four tools (insert_row, list_rows, search_web, fetch_page) and ensure
the agent implementation created by buildInvestigateAgent(authorizedDatasetId,
authContext, columns) uses those tools (not just update_row_by_key) and returns
the exact final structured output (INSERTED/SUMMARY/CLUES/REASON); if the
contract is intentionally changing, instead update all dependent callers/tests
and repository guidelines consistently to reflect the new single-tool behavior
but do not leave this file inconsistent.
---
Nitpick comments:
In `@backend/package.json`:
- Line 30: The package.json currently specifies "typescript": "^5.8.3"; confirm
this version is acceptable by (1) checking official npm/CVE/advisory feeds for
any reported security vulnerabilities for typescript@5.8.3 and documenting that
none were found, (2) running the project CI/build and full test/typecheck to
detect any compiler/regression issues caused by 5.8.3, and (3) if
build/typecheck failures occur, update package.json to a newer safe 5.x release
(and update lockfile) rather than jumping to 6.x; if everything passes, leave
"^5.8.3" but note in the PR that regressions are a known non-security risk.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 32a5a17c-21f2-4a1e-a10e-f543d896dc3b
⛔ Files ignored due to path filters (1)
backend/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (10)
backend/package.jsonbackend/src/env.tsbackend/src/index.tsbackend/src/mastra/agents/investigate.tsbackend/src/mastra/agents/populate.tsbackend/src/mastra/agents/triage-extract.tsbackend/src/mastra/tools/investigate-tool.tsbackend/src/mastra/tools/web-tools.tsbackend/src/mastra/workflows/populate.tsfrontend/convex/datasetRows.ts
Remove the hard top-5 / top-10 count limits per batch. Instead, dispatch every URL that clears a quality bar (relevance, data value, source authority, novelty against already-complete rows). Consolidate the redundant second/subsequent batch sections into a single loop. Steer searches using list_rows output to avoid re-searching complete entities. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/src/mastra/agents/populate.ts (1)
91-98:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHarden
targetRowsparsing to prevent disabled caps on invalid env values.
Number(process.env.BIGSET_POPULATE_TARGET_ROWS || "20")can yieldNaN/0/negative values, which can break cap/stop behavior and allow runaway extraction.💡 Minimal fix
+const DEFAULT_TARGET_ROWS = (() => { + const parsed = Number(process.env.BIGSET_POPULATE_TARGET_ROWS); + return Number.isFinite(parsed) && parsed > 0 ? Math.floor(parsed) : 20; +})(); + export function buildPopulateAgent( authorizedDatasetId: string, authContext: AuthContext, columns: PopulateColumn[], - targetRows: number = Number(process.env.BIGSET_POPULATE_TARGET_ROWS || "20"), + targetRows: number = DEFAULT_TARGET_ROWS, ): Agent {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/mastra/agents/populate.ts` around lines 91 - 98, The targetRows env parsing is unsafe and can produce NaN/0/negative values which disable extraction caps; update the parsing for targetRows (the variable passed into buildExtractTool) to robustly parse and validate the environment value (use parseInt/Number, ensure it's a finite positive integer, floor it) and fall back to the default 20 when the parsed value is invalid or <= 0; optionally clamp to a sensible maximum before passing into buildExtractTool to avoid runaway extraction.
♻️ Duplicate comments (1)
backend/src/mastra/agents/populate.ts (1)
31-56:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftRestore the orchestrator’s required tool surface (
search_web,fetch_page,investigate_row).This change moves orchestration to
extract_rows/list_rows, which violates the required populate-agent contract and removes the mandated no-write boundary at the orchestrator layer.As per coding guidelines, "Populate orchestrator agent must have exactly 3 tools:
search_web,fetch_page,investigate_rowwith no write access".Also applies to: 105-109
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/mastra/agents/populate.ts` around lines 31 - 56, The orchestrator removed the mandated tool surface and must be restored to expose exactly the three read-only tools: search_web, fetch_page, and investigate_row and must not perform any write operations itself; revert or refactor the current orchestration flow that moved responsibilities into extract_rows/list_rows so that populate (or the populate-agent orchestrator function) only calls search_web to get URL candidates, calls fetch_page to retrieve page content, and calls investigate_row to analyze content and emit leads, and ensure extract_rows/list_rows remain downstream workers (or are called by investigate_row) rather than replacing the orchestrator tools; locate usages of extract_rows, list_rows, and the populate-orchestrator function and change them so the orchestrator exposes and calls the three tool functions (search_web, fetch_page, investigate_row) and contains no writes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@backend/src/mastra/agents/populate.ts`:
- Around line 91-98: The targetRows env parsing is unsafe and can produce
NaN/0/negative values which disable extraction caps; update the parsing for
targetRows (the variable passed into buildExtractTool) to robustly parse and
validate the environment value (use parseInt/Number, ensure it's a finite
positive integer, floor it) and fall back to the default 20 when the parsed
value is invalid or <= 0; optionally clamp to a sensible maximum before passing
into buildExtractTool to avoid runaway extraction.
---
Duplicate comments:
In `@backend/src/mastra/agents/populate.ts`:
- Around line 31-56: The orchestrator removed the mandated tool surface and must
be restored to expose exactly the three read-only tools: search_web, fetch_page,
and investigate_row and must not perform any write operations itself; revert or
refactor the current orchestration flow that moved responsibilities into
extract_rows/list_rows so that populate (or the populate-agent orchestrator
function) only calls search_web to get URL candidates, calls fetch_page to
retrieve page content, and calls investigate_row to analyze content and emit
leads, and ensure extract_rows/list_rows remain downstream workers (or are
called by investigate_row) rather than replacing the orchestrator tools; locate
usages of extract_rows, list_rows, and the populate-orchestrator function and
change them so the orchestrator exposes and calls the three tool functions
(search_web, fetch_page, investigate_row) and contains no writes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b0a10655-2fca-4ffc-82f7-b3c3f0950c14
📒 Files selected for processing (1)
backend/src/mastra/agents/populate.ts
- env.ts: validate BIGSET_POPULATE_TARGET_ROWS — NaN, zero, and negative values now fall back to the default of 20 instead of silently breaking the agent's stop condition. - datasetRows.mergeUpdate (new Convex internalMutation): atomic per-field blank-aware merge. Blank cells are always filled with any non-empty incoming value regardless of confidence; non-blank cells are only overwritten when newConfidence > existing row confidence. The authoritative check lives in Convex so two concurrent investigate agents can't both pass a stale client-side confidence check and race to write — the compare-and-merge is serialized inside a single transaction. Quota is charged only on actual changes; history entries are recorded per changed field. - investigate-tool.ts / buildUpdateRowByKeyTool: drop the row-wide confidence pre-check (which blocked filling blank columns on partial high-confidence rows) and call mergeUpdate instead of update. The local rowIndex is updated optimistically using the same per-field rules so subsequent calls within the run stay consistent without an extra Convex round-trip. - datasetRows.deleteIncomplete: skip _-prefixed keys (_confidence, _sources, etc.) in the completeness check — they are internal fields that are never required columns. - backend/CLAUDE.md: rewrite Mastra section to reflect the tri-agent architecture and new confidence/merge semantics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
One textual conflict in investigate-tool.ts: commit 1e9cc8b on main ("Unify local env loading") fixed a redundant `?? false` in the old parseInvestigateResult return block. Our branch had already replaced that entire function with parseTriageExtractOutput and parseInvestigateOutput. Git's heuristic matched the `return { ... }` structure across the two different functions and flagged a false conflict. Resolution: keep HEAD (our new parsers are correct; the old function no longer exists). All other files merged cleanly: - backend/src/env.ts: main's dotenv path consolidation + our POPULATE_TARGET_ROWS validation both land without conflict. - frontend/convex/datasetRows.ts: main's rowCount denormalized counter (insert/remove/clearByDataset) merges alongside our mergeUpdate mutation and deleteIncomplete fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR #85 ("Make populate status truthful") refactored the /populate route from synchronous (awaiting the workflow inline) to async: it now claims the dataset, creates a run, fires runPopulateWorkflowInBackground() as a background void promise, and returns 202 immediately. Our branch still had the old synchronous pattern with the post-workflow notification and status-update logic inline in the route handler — including our deleteIncomplete pruning addition. The conflict was the entire old inline block vs main's single 202 return. Resolution: - Take main's 202 return — the async pattern is strictly better. - Move the deleteIncomplete pruning into runPopulateWorkflowInBackground() (already introduced by main), placed before the countByDataset call so we count only complete rows when deciding whether to set status "live". All other files (populate.ts workflow, frontend status/schema changes) merged cleanly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/src/env.ts (1)
37-40:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix
POPULATE_TARGET_ROWSvalidation to reject non-integers before flooring.Current logic accepts fractional values and can produce
0(e.g.,0.5 -> 0), which can prematurely terminate populate runs. It also contradicts the comment that non-integers fallback to20.💡 Suggested patch
POPULATE_TARGET_ROWS: (() => { const parsed = Number(process.env.BIGSET_POPULATE_TARGET_ROWS); - return Number.isFinite(parsed) && parsed > 0 ? Math.floor(parsed) : 20; + return Number.isInteger(parsed) && parsed > 0 ? parsed : 20; })(),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/env.ts` around lines 37 - 40, POPULATE_TARGET_ROWS currently accepts fractional numbers because it checks Number.isFinite(parsed) then floors, which lets values like 0.5 become 0; change the validation so only integer positive values are accepted (use Number.isInteger(parsed) && parsed > 0) and return parsed as-is (or Math.floor only if you want to coerce negatives handled) otherwise fall back to 20; update the initialization for POPULATE_TARGET_ROWS in env.ts to perform Number.isInteger(parsed) && parsed > 0 before using the value.frontend/convex/datasetRows.ts (1)
334-360:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
deleteIncompletedoes not update the cachedrowCount, causing stale dashboard counts.
insert,remove, andclearByDatasetall maintain the denormalizedrowCountfield on the dataset document. However,deleteIncompletecallsctx.db.delete(row._id)directly without decrementing the counter. After pruning, the dashboard will display a higher row count than actually exists until the next mutation that self-heals.🐛 Proposed fix: update rowCount after deleting rows
export const deleteIncomplete = internalMutation({ args: { datasetId: v.id("datasets"), columnNames: v.array(v.string()), }, handler: async (ctx, args) => { const rows = await ctx.db .query("datasetRows") .withIndex("by_dataset", (q) => q.eq("datasetId", args.datasetId)) .collect(); let deletedCount = 0; for (const row of rows) { const data = row.data as Record<string, unknown>; const isComplete = args.columnNames.every((col) => { if (col.startsWith("_")) return true; // skip internal fields (_confidence, _sources, etc.) const val = data[col]; return val !== null && val !== undefined && val !== ""; }); if (!isComplete) { await ctx.db.delete(row._id); deletedCount++; } } + + // Update the cached rowCount to stay in sync with actual rows. + if (deletedCount > 0) { + const dataset = await ctx.db.get(args.datasetId); + if (dataset) { + const newCount = + typeof dataset.rowCount === "number" + ? Math.max(0, dataset.rowCount - deletedCount) + : rows.length - deletedCount; // self-heal if rowCount was missing + await ctx.db.patch(args.datasetId, { rowCount: newCount }); + } + } + return { deletedCount }; }, });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/convex/datasetRows.ts` around lines 334 - 360, The deleteIncomplete mutation removes rows but never updates the dataset's denormalized rowCount, leaving stale counts; modify deleteIncomplete to mirror the behavior of insert/remove/clearByDataset by decrementing the dataset document's rowCount after deletions (use the same DB update/patch API used by insert/remove to subtract deletedCount for args.datasetId), or update the rowCount incrementally as you delete each row so the dataset's rowCount stays consistent with the actual rows.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@backend/src/env.ts`:
- Around line 37-40: POPULATE_TARGET_ROWS currently accepts fractional numbers
because it checks Number.isFinite(parsed) then floors, which lets values like
0.5 become 0; change the validation so only integer positive values are accepted
(use Number.isInteger(parsed) && parsed > 0) and return parsed as-is (or
Math.floor only if you want to coerce negatives handled) otherwise fall back to
20; update the initialization for POPULATE_TARGET_ROWS in env.ts to perform
Number.isInteger(parsed) && parsed > 0 before using the value.
In `@frontend/convex/datasetRows.ts`:
- Around line 334-360: The deleteIncomplete mutation removes rows but never
updates the dataset's denormalized rowCount, leaving stale counts; modify
deleteIncomplete to mirror the behavior of insert/remove/clearByDataset by
decrementing the dataset document's rowCount after deletions (use the same DB
update/patch API used by insert/remove to subtract deletedCount for
args.datasetId), or update the rowCount incrementally as you delete each row so
the dataset's rowCount stays consistent with the actual rows.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fa06ba5d-6848-4387-adb5-d3e8897a9229
📒 Files selected for processing (5)
backend/package.jsonbackend/src/env.tsbackend/src/index.tsbackend/src/mastra/workflows/populate.tsfrontend/convex/datasetRows.ts
✅ Files skipped from review due to trivial changes (1)
- backend/package.json
…igation
Replace the triage-extract agent with a leaner extract agent, move
investigate_entity to the orchestrator tier, and add pendingInserts
dedup for concurrent extract agents.
Architecture changes:
- triage-extract.ts → extract.ts (buildExtractAgent): no triage step, no
investigate spawning; receives a batch of 1–5 URLs, fetches all in
parallel, calls batch_insert_rows once with all entities combined, returns
LEADS/SOURCE_QUALITY to the orchestrator
- Orchestrator now owns investigate_entity: after all parallel extract_rows
calls finish, it calls list_rows then emits all investigate_entity calls
simultaneously for every incomplete row
- buildExtractTool now returns { extractRowsTool, listRowsTool,
investigateEntityTool } — the orchestrator receives all three
- pendingInserts Set added to closure: prevents two concurrent extract agents
from double-inserting the same primary key (JS event-loop atomicity
makes the check+add race-safe without Convex schema changes)
- extract_rows input accepts 1–5 URLs per call (was exactly 1)
- buildInsertRowTool removed (dead code)
- BIGSET_POPULATE_TARGET_ROWS env var wired through env.ts → workflow →
buildPopulateAgent → buildExtractTool
- patchMastraSanitizeToolCallInput called at startup in index.ts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
backend/src/mastra/tools/model-middleware.ts (2)
57-88: 💤 Low valueConsider removing or reducing the verbosity of debug logging.
Lines 62 and 74 emit console.log for every stream and every tool-call chunk. In production with many concurrent agent runs, this will generate significant log noise. Consider removing line 62 entirely (it fires on every stream) and gating the log on line 74 behind a debug flag or only logging when repair actually occurs.
♻️ Proposed cleanup
export function withToolCallRepair(model: any): any { return wrapLanguageModel({ model, middleware: { wrapStream: async ({ doStream }: any) => { - console.log("[model-middleware] wrapStream called"); const result = await doStream(); const { stream, ...rest } = result; const fixedStream = stream.pipeThrough( new TransformStream({ transform(chunk: any, controller: any) { if ( chunk != null && chunk.type === "tool-call" && typeof chunk.input === "string" ) { - console.log(`[model-middleware] tool-call chunk: ${chunk.toolName} input starts with: ${chunk.input.slice(0, 30)}`); const fixedInput = tryUnwrapDoubleEncodedInput(chunk.input); + if (fixedInput !== chunk.input) { + console.log(`[model-middleware] Repaired tool-call input for ${chunk.toolName}`); + } controller.enqueue({ ...chunk, input: fixedInput }); } else { controller.enqueue(chunk); } }, }), );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/mastra/tools/model-middleware.ts` around lines 57 - 88, Remove the noisy unconditional logging in withToolCallRepair: delete the console.log inside wrapStream that logs every stream, and change the console.log inside the TransformStream transform to only run when a repair actually occurs or when a debug flag is enabled; use the existing tryUnwrapDoubleEncodedInput result to determine if a repair happened (i.e., only log when fixedInput differs from chunk.input) or check a runtime debug flag before logging, keeping the rest of the wrapStream/TransformStream logic intact.
103-146: ⚖️ Poor tradeoffGlobal
JSON.parsemonkey-patching is risky and duplicates logic.Patching the global
JSON.parseaffects every JSON operation in the application—Convex client, Fastify body parsing, all libraries. While the patch is defensive (only activates on parse failure with"-prefixed strings), this pattern makes debugging harder and could have subtle unintended effects.Additionally, lines 123–136 duplicate the extraction logic already in
tryUnwrapDoubleEncodedInput. Reuse the existing helper to maintain a single source of truth.♻️ Proposed refactor to reuse helper
export async function patchMastraSanitizeToolCallInput(): Promise<void> { try { const originalJsonParse = JSON.parse; (JSON as any).parse = function patchedJsonParse(text: string, ...rest: any[]) { try { return originalJsonParse.call(this, text, ...rest); } catch (err) { - // If JSON.parse fails on a string that starts with `"`, try the - // double-encoding recovery: extract the JSON object between the - // first `{` and last `}`. - if (typeof text === "string" && text.startsWith('"')) { - const firstBrace = text.indexOf("{"); - const lastBrace = text.lastIndexOf("}"); - if (firstBrace !== -1 && lastBrace > firstBrace) { - const candidate = text.slice(firstBrace, lastBrace + 1); - try { - const recovered = originalJsonParse.call(this, candidate, ...rest); - console.log( - `[model-middleware/patch] Recovered double-encoded JSON (${candidate.length} chars): ${candidate.slice(0, 60)}...`, - ); - return recovered; - } catch { - // Recovery also failed — re-throw the original error - } - } + if (typeof text === "string") { + const repaired = tryUnwrapDoubleEncodedInput(text); + if (repaired !== text) { + return originalJsonParse.call(this, repaired, ...rest); + } } throw err; } };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/src/mastra/tools/model-middleware.ts` around lines 103 - 146, patchMastraSanitizeToolCallInput currently monkey-patches global JSON.parse (risking wide side-effects) and reimplements double-encoding extraction logic; instead, remove the global monkey-patch and change patchMastraSanitizeToolCallInput to call the existing tryUnwrapDoubleEncodedInput helper when a parse fails in the local Mastra scope (i.e., detect failure in the context where Mastra processes tool calls), use originalJsonParse for the primary parse attempt, and on failure pass the input to tryUnwrapDoubleEncodedInput to attempt recovery and logging; keep references to originalJsonParse and ensure errors are re-thrown if recovery fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/src/mastra/tools/investigate-tool.ts`:
- Around line 236-243: Remove the client-side short-circuit that skips updating
when confidence <= existingEntry.confidence inside the loop in investigate-tool
(the block that pushes skipped.push(primary_key)); instead always call the
server-side atomic mergeUpdate (the function/mutation named mergeUpdate) for
existingEntry cases so blank-field fills and atomic per-field merge rules are
enforced by the Convex mutation; similarly remove the duplicate short-circuit at
the later check (the other confidence comparison around lines referenced) so all
decisioning happens server-side.
- Around line 795-803: The code currently only updates rowIndex when
convexConfidence > existingEntry.confidence, which ignores equal-confidence
updates that may fill blanks; in extract_rows ensure rowIndex is refreshed from
Convex at the start of the function to pick up parallel-agent writes, and change
the update condition inside the loop (where convexConfidence, existingEntry,
rowIndex.set are used) to treat equal-confidence as an update (use >= or
explicitly check for equality and compare cells) so you still overwrite/merge
cells when convexConfidence === existingEntry.confidence.
- Around line 215-227: The intra-batch dedup currently drops later rows when
seenInBatch.has(primary_key) is true; instead, change the logic in
investigate-tool.ts around the loop over rows so duplicates are merged into the
first-seen record: when encountering a row with an existing primary_key, do not
push to skipped, but merge its non-empty fields into the stored record (e.g.,
for scalar fields prefer the non-empty value or choose the max confidence),
merge array/collection fields like sources by deduplicating and concatenating,
and deep-merge the data object fields preferring non-null values; update the
stored entry referenced by primary_key (or maintain a map keyed by primary_key)
so the final collection contains merged entries for persistence.
---
Nitpick comments:
In `@backend/src/mastra/tools/model-middleware.ts`:
- Around line 57-88: Remove the noisy unconditional logging in
withToolCallRepair: delete the console.log inside wrapStream that logs every
stream, and change the console.log inside the TransformStream transform to only
run when a repair actually occurs or when a debug flag is enabled; use the
existing tryUnwrapDoubleEncodedInput result to determine if a repair happened
(i.e., only log when fixedInput differs from chunk.input) or check a runtime
debug flag before logging, keeping the rest of the wrapStream/TransformStream
logic intact.
- Around line 103-146: patchMastraSanitizeToolCallInput currently monkey-patches
global JSON.parse (risking wide side-effects) and reimplements double-encoding
extraction logic; instead, remove the global monkey-patch and change
patchMastraSanitizeToolCallInput to call the existing
tryUnwrapDoubleEncodedInput helper when a parse fails in the local Mastra scope
(i.e., detect failure in the context where Mastra processes tool calls), use
originalJsonParse for the primary parse attempt, and on failure pass the input
to tryUnwrapDoubleEncodedInput to attempt recovery and logging; keep references
to originalJsonParse and ensure errors are re-thrown if recovery fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f7506fc9-a943-4524-a4dc-9ae3c32de468
📒 Files selected for processing (8)
.env.examplebackend/CLAUDE.mdbackend/src/index.tsbackend/src/mastra/agents/extract.tsbackend/src/mastra/agents/populate.tsbackend/src/mastra/tools/investigate-tool.tsbackend/src/mastra/tools/model-middleware.tsbackend/src/mastra/workflows/populate.ts
✅ Files skipped from review due to trivial changes (1)
- backend/CLAUDE.md
| if (existingEntry) { | ||
| // ── Update path: row already exists ──────────────────────────────── | ||
| if (confidence <= existingEntry.confidence) { | ||
| // Equal or higher confidence already on record — nothing to do. | ||
| skipped.push(primary_key); | ||
| continue; | ||
| } | ||
|
|
There was a problem hiding this comment.
Remove the local confidence short-circuit before calling mergeUpdate.
Skipping on cached confidence <= existingEntry.confidence prevents valid blank-field fills that the atomic server merge would accept, and makes decisions from stale in-memory state.
Suggested fix
- if (confidence <= existingEntry.confidence) {
- // Equal or higher confidence already on record — nothing to do.
- skipped.push(primary_key);
- continue;
- }
-
console.log(
`[batch_insert_rows] ${logCtx} pk="${primary_key}" updating ` +
`(confidence ${existingEntry.confidence.toFixed(2)}→${confidence.toFixed(2)})`,
);
try {
- await convex.mutation(internal.datasetRows.mergeUpdate, {
+ const result = await convex.mutation(internal.datasetRows.mergeUpdate, {
id: existingEntry.rowId as any,
expectedDatasetId: authorizedDatasetId,
newData: cleanedData,
newConfidence: confidence,
newSources: sources,
});
+ if (!result.merged) {
+ skipped.push(primary_key);
+ continue;
+ }As per coding guidelines, "Enforce per-field blank-aware merge rules atomically in Convex mutations, not in client-side tool code, to prevent race conditions during parallel agent runs".
Also applies to: 249-255
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/src/mastra/tools/investigate-tool.ts` around lines 236 - 243, Remove
the client-side short-circuit that skips updating when confidence <=
existingEntry.confidence inside the loop in investigate-tool (the block that
pushes skipped.push(primary_key)); instead always call the server-side atomic
mergeUpdate (the function/mutation named mergeUpdate) for existingEntry cases so
blank-field fills and atomic per-field merge rules are enforced by the Convex
mutation; similarly remove the duplicate short-circuit at the later check (the
other confidence comparison around lines referenced) so all decisioning happens
server-side.
… call - All three agents (orchestrator, extract, investigate) now use deepseek/deepseek-v4-pro via OpenRouter - extract_rows input reverted to exactly 1 URL (max(1)) — the orchestrator dispatches one parallel extract_rows call per URL; no batching needed since the orchestrator handles investigate_entity directly - Extract agent instructions updated accordingly (single fetch, single page) - Orchestrator instructions updated: "one URL per call" instead of batches Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…iterations - maxSteps: 40 → 20 for extract agents — prevents paginated browse pages (metacritic /browse/) from spiraling to 38-40 steps (~10 min per call); they now exit at 20 steps and return LEADS for the next iteration instead - Replace full buildExistingRowsText() dump with a compact count + 30-key sample in each extract agent prompt — row dedup is handled by the tool layer (rowIndex + pendingInserts), not by the agent, so the full 300-row dump was wasted context that inflated model processing time - Add URL QUALITY guidance to orchestrator Phase 2 instructions: prefer single-page editorial/list sources (Wikipedia, "best of", rankings); avoid paginated browse/catalog URLs (/browse/, ?page=, ?sort=, etc.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ise orchestrator steps
Extract agent (agents/extract.ts):
- Hard budget of 2 tool calls: exactly 1 fetch_page + 1 batch_insert_rows
- Explicit: no pagination, no following links — add pagination URLs to LEADS instead
- HARD BUDGET section at top of instructions so the constraint is unmissable
Investigate agent (agents/investigate.ts):
- Remove second search round ("if first round didn't fill everything, search more")
- Procedure is now: 1-2 parallel searches → 1-2 fetches → 1 update_row_by_key → done
- Borrowed from main branch: shorter, more decisive, no retry loop
investigate-tool.ts:
- Extract agent maxSteps: 20 → 5 (2 tool calls + 3 buffer)
- Investigate agent maxSteps: 20 → 8 (search + fetch + update + buffer)
Orchestrator (agents/populate.ts):
- Phase 2: hard cap of ceil(targetRows/4) extract_rows calls per iteration
(was soft "up to" guidance — LLM was ignoring it)
- Phase 4: hard batch of 20 investigate_entity calls per iteration, emitted in
a single parallel response, prioritised by fewest missing columns
- Both caps are called out as HARD LIMITS in the instructions and the RULES block
Workflow (workflows/populate.ts):
- Orchestrator maxSteps: 80 → 150 (headroom for 4+ iterations at ~31 steps each)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eration counter Root cause (from run logs): the orchestrator dispatched 13 extract_rows calls (hard cap = 5) and the metacritic extract agent made 5 fetch_page calls using all its maxSteps budget before ever calling batch_insert_rows. Both limits were soft LLM instructions — the model ignored them. Single-use fetch_page wrapper (investigate-tool.ts + web-tools.ts + extract.ts): - Extract executeFetchPage() logic into a shared export in web-tools.ts - buildExtractAgent() now receives fetchTool as a parameter (like batchInsertRowsTool) - In extract_rows.execute, create a per-call onceFetchTool that wraps executeFetchPage with a fetchUsed boolean: the second call returns a hard error message telling the agent to call batch_insert_rows immediately with what it already has - This makes the "one fetch per extract agent" limit physically unbypassable Per-iteration extract counter (investigate-tool.ts): - Add iterationExtractCount and MAX_EXTRACT_PER_ITER = ceil(targetRows/4) to closure - Synchronous check+increment before the first await is atomic in JS's event loop (same pattern as pendingInserts) — safe under parallel extract_rows calls - Calls beyond the cap return immediately with the URL as a LEAD for next iteration - list_rows.execute resets the counter (list_rows is called once at Phase 2→3 boundary) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…epseek-v4-pro) Delete backend/src/mastra/tools/model-middleware.ts and remove its startup call from index.ts. The file contained two kimi-k2-specific workarounds: - withToolCallRepair (dead code — no agent ever used it) - patchMastraSanitizeToolCallInput (patched global JSON.parse to recover double-encoded tool-call arguments that kimi-k2 emitted) deepseek/deepseek-v4-pro does not exhibit the double-encoding bug so both repairs are unnecessary. Removing them eliminates the monkey-patch on the global JSON.parse and the now-spurious TypeScript error on the wrapStream middleware type. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Restructures the AI data-collection pipeline from a two-tier system (orchestrator → investigate) into a three-tier system (orchestrator → triage-extract → investigate), improving data quality, reducing per-agent context load, and creating a structured extension point for future browser/TinyFish agent routing.
Architecture before → after
Before
After
Feature changes
1. Triage-Extract Agent —
agents/triage-extract.ts(new)Replaces the previous multi-URL extract agent. Each call receives exactly one URL. Its first job is triage: after fetching the page it classifies it into one of five statuses before deciding what to do next.
extract_nowneeds_browser_agentneeds_form_filllow_valueblockedWhy the triage step matters beyond skipping bad pages: breaking each URL into its own short-lived agent keeps context windows small (one fetch ≈ 15 K chars instead of accumulating across multiple fetches), which was the primary cause of the previous agent hitting its step cap before writing any rows. More importantly, the five statuses create a structured dispatch table for future agent routing — pages classified as
needs_browser_agentorneeds_form_fillcan be handed off to a TinyFish agent or Playwright browser agent in a future iteration without changing either the orchestrator or the investigate layer. The triage result is returned to the orchestrator as a structured field, giving that routing path a clean hook to attach to.After classifying a page as
extract_now, the agent reads the full page first, identifies all matching entities, then batch-inserts all rows, then callsinvestigate_entityfor each row with missing columns — avoiding interleaved read/write step churn.2. Investigate Agent —
agents/investigate.ts(restored + adapted)Restored to the autonomous single-entity researcher pattern from the main branch. Key adaptation: the row already exists (inserted by the triage-extract agent), so the investigate agent works exclusively with
update_row_by_key, targeting specific missing columns. It runs parallel web searches, fetches the most promising pages, and fills the gaps — the same deep-research quality as the original, now triggered precisely when and where data is missing.3. Populate Orchestrator —
agents/populate.ts(updated)Search-dispatch cadence:
extract_rowscalls →list_rowslist_rows→ repeatQuality threshold replaces fixed count caps: instead of "select top 5" or "select top 10," the orchestrator dispatches every URL that clears a quality bar — relevance, data value, source authority, and novelty against already-complete rows. This prevents valuable URLs from being permanently discarded just because they didn't rank in an arbitrary top-N. The stagnant-batch stop condition (2 consecutive batches with no new complete rows) bounds total work without needing an artificial per-batch limit.
Search steering via
list_rows: after each batch the orchestrator reads the current dataset state and steers subsequent searches toward entity types that are still missing or incomplete, avoiding wasted work on already-complete rows.Other orchestrator improvements:
${currentMonth} ${currentYear}) so queries for "recent" or "current" data use the actual date, not training-data defaultsleadsfield fromextract_rowsreplaces structured JSON arrays, reducing schema brittleness4. Shared rowIndex +
list_rows—tools/investigate-tool.tsbuildExtractToolnow returns{ extractRowsTool, listRowsTool }— both backed by the same in-memoryrowIndexclosure.list_rowstherefore always reflects real-time state (including rows written by concurrentinvestigate_entitysub-calls) without a Convex round-trip. TheBIGSET_POPULATE_TARGET_ROWScap is enforced both in the orchestrator stop condition and as a hard early-exit inextract_rows.5. Post-workflow row pruning —
index.ts,datasetRows.tsAfter the populate workflow finishes, incomplete rows (any column blank) are deleted before the dataset is marked live. The dataset card and table therefore only ever show fully-populated rows. The prune is best-effort — a failure logs a warning but does not block the dataset-ready notification.
6.
POPULATE_TARGET_ROWSenv var —env.tsHard cap on complete rows per populate run (default: 20). Override with
BIGSET_POPULATE_TARGET_ROWS=Nin the root.env.Test plan
make convex-pushto deploydeleteIncompleteandlistInternalto the self-hosted Convex instance:4111) shows all three agent tiers in the workflow trace🤖 Generated with Claude Code