Skip to content

Three-tier populate agent: triage-extract + investigate subagents#91

Open
MMeteorL wants to merge 12 commits into
mainfrom
mastra_agent_system_update
Open

Three-tier populate agent: triage-extract + investigate subagents#91
MMeteorL wants to merge 12 commits into
mainfrom
mastra_agent_system_update

Conversation

@MMeteorL
Copy link
Copy Markdown
Collaborator

@MMeteorL MMeteorL commented May 25, 2026

Summary

Restructures the AI data-collection pipeline from a two-tier system (orchestrator → investigate) into a three-tier system (orchestrator → triage-extract → investigate), improving data quality, reducing per-agent context load, and creating a structured extension point for future browser/TinyFish agent routing.

Architecture before → after

Before

Populate Orchestrator  (search_web · fetch_page · investigate_row)
    └─▶ Investigate Agent × N   (search + fetch + insert — one entity per call)

After

Populate Orchestrator  (search_web · extract_rows · list_rows)
    └─▶ Triage-Extract Agent × N   (fetch_page · insert_row · update_row_by_key · investigate_entity)
             └─▶ Investigate Agent × M   (search_web · fetch_page · update_row_by_key)

Feature changes

1. Triage-Extract Agent — agents/triage-extract.ts (new)

Replaces the previous multi-URL extract agent. Each call receives exactly one URL. Its first job is triage: after fetching the page it classifies it into one of five statuses before deciding what to do next.

Triage status Meaning
extract_now Readable content with matching entities — proceed to extraction
needs_browser_agent JS-rendered SPA, login wall, or requires real browser interaction
needs_form_fill Search form or user input required before data appears
low_value Accessible but no entities relevant to the dataset
blocked 403 / 404 / paywall / CAPTCHA

Why the triage step matters beyond skipping bad pages: breaking each URL into its own short-lived agent keeps context windows small (one fetch ≈ 15 K chars instead of accumulating across multiple fetches), which was the primary cause of the previous agent hitting its step cap before writing any rows. More importantly, the five statuses create a structured dispatch table for future agent routing — pages classified as needs_browser_agent or needs_form_fill can be handed off to a TinyFish agent or Playwright browser agent in a future iteration without changing either the orchestrator or the investigate layer. The triage result is returned to the orchestrator as a structured field, giving that routing path a clean hook to attach to.

After classifying a page as extract_now, the agent reads the full page first, identifies all matching entities, then batch-inserts all rows, then calls investigate_entity for each row with missing columns — avoiding interleaved read/write step churn.

2. Investigate Agent — agents/investigate.ts (restored + adapted)

Restored to the autonomous single-entity researcher pattern from the main branch. Key adaptation: the row already exists (inserted by the triage-extract agent), so the investigate agent works exclusively with update_row_by_key, targeting specific missing columns. It runs parallel web searches, fetches the most promising pages, and fills the gaps — the same deep-research quality as the original, now triggered precisely when and where data is missing.

3. Populate Orchestrator — agents/populate.ts (updated)

Search-dispatch cadence:

  • First batch: 5 parallel searches → dispatch all qualifying URLs as parallel extract_rows calls → list_rows
  • All subsequent batches: up to 20 parallel searches (combining leads from extract agents with new angles) → dispatch all qualifying URLs → list_rows → repeat

Quality threshold replaces fixed count caps: instead of "select top 5" or "select top 10," the orchestrator dispatches every URL that clears a quality bar — relevance, data value, source authority, and novelty against already-complete rows. This prevents valuable URLs from being permanently discarded just because they didn't rank in an arbitrary top-N. The stagnant-batch stop condition (2 consecutive batches with no new complete rows) bounds total work without needing an artificial per-batch limit.

Search steering via list_rows: after each batch the orchestrator reads the current dataset state and steers subsequent searches toward entity types that are still missing or incomplete, avoiding wasted work on already-complete rows.

Other orchestrator improvements:

  • Live date injection (${currentMonth} ${currentYear}) so queries for "recent" or "current" data use the actual date, not training-data defaults
  • Natural language leads field from extract_rows replaces structured JSON arrays, reducing schema brittleness
  • URL deduplication: the orchestrator tracks every dispatched URL and never re-sends

4. Shared rowIndex + list_rowstools/investigate-tool.ts

buildExtractTool now returns { extractRowsTool, listRowsTool } — both backed by the same in-memory rowIndex closure. list_rows therefore always reflects real-time state (including rows written by concurrent investigate_entity sub-calls) without a Convex round-trip. The BIGSET_POPULATE_TARGET_ROWS cap is enforced both in the orchestrator stop condition and as a hard early-exit in extract_rows.

5. Post-workflow row pruning — index.ts, datasetRows.ts

After the populate workflow finishes, incomplete rows (any column blank) are deleted before the dataset is marked live. The dataset card and table therefore only ever show fully-populated rows. The prune is best-effort — a failure logs a warning but does not block the dataset-ready notification.

6. POPULATE_TARGET_ROWS env var — env.ts

Hard cap on complete rows per populate run (default: 20). Override with BIGSET_POPULATE_TARGET_ROWS=N in the root .env.


Test plan

  • Run make convex-push to deploy deleteIncomplete and listInternal to the self-hosted Convex instance
  • Trigger a populate run via the UI and confirm rows appear in real-time via Convex reactive queries
  • Confirm that incomplete rows are absent from the final dataset table (pruned post-workflow)
  • Verify the Mastra Studio (:4111) shows all three agent tiers in the workflow trace
  • Test with a time-sensitive prompt (e.g. "YC W26 companies") and confirm search queries include the current year/month

🤖 Generated with Claude Code

MMeteorL and others added 2 commits May 25, 2026 10:26
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
First batch: 5 searches → top 5 URLs → 5 parallel extract_rows.
Subsequent batches: up to 20 searches (from leads + new angles) →
top 10 URLs → up to 10 parallel extract_rows → list_rows → repeat.
Mirrors the original investigate_row dispatch discipline but scaled
for the triage-extract architecture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 25, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors the dataset population pipeline from a single investigate-row flow into a tri-agent orchestration: a populate orchestrator dispatches URL batches to an extract agent that inserts/updates rows, while a list_rows tool tracks completeness and an investigate_entity tool fills missing columns in parallel. The extract tool suite introduces per-run shared row indexing, confidence-aware merging, and deduplication via a pendingInserts set. Two new Convex mutations provide atomic per-field merging (mergeUpdate) and incomplete-row cleanup (deleteIncomplete). A new model-middleware module repairs double-encoded JSON in tool-call arguments via global JSON.parse patching and per-stream interception. The populate agent now accepts a targetRows cap (default 20), enforced in the extract_rows tool, and the background handler prunes incomplete rows after population completes. Search results optionally include site_name, and comprehensive documentation updates describe the new architecture and merge semantics.

Sequence Diagram(s)

sequenceDiagram
  participant Orchestrator
  participant SearchWeb
  participant ExtractRows
  participant ExtractAgent
  participant RowIndex as RowIndex<br/>(in-memory)
  participant Convex
  participant ListRows
  participant InvestigateEntity
  participant InvestigateAgent
  
  loop Per iteration until done
    Orchestrator->>SearchWeb: parallel searches
    SearchWeb-->>Orchestrator: leads with URLs
    Orchestrator->>ExtractRows: batch promising URLs
    ExtractRows->>RowIndex: refresh from Convex
    ExtractRows->>ExtractAgent: extract from batch
    ExtractAgent->>Convex: batch_insert_rows (new/merges)
    Convex->>RowIndex: sync inserted/merged rows
    Orchestrator->>ListRows: report progress
    ListRows-->>Orchestrator: completed vs incomplete count
    par Investigate incomplete rows
      Orchestrator->>InvestigateEntity: research missing columns
      InvestigateEntity->>RowIndex: get row context
      InvestigateEntity->>InvestigateAgent: fill via search
      InvestigateAgent->>Convex: update_row_by_key (confidence merge)
      Convex->>RowIndex: sync updated cells
    end
  end
  Orchestrator-->>Orchestrator: stop (targetRows reached or stagnant)
Loading

Possibly related PRs

  • tinyfish-io/bigset#72: Both PRs modify the post-populate "dataset-ready" notification path in backend/src/index.ts; main PR adds row pruning and zero-row failure gating while #72 implements email and analytics behavior.
  • tinyfish-io/bigset#26: Main PR directly refactors the populate workflow and tooling introduced in #26, replacing the prior investigate-row flow with the new extract/list/investigate_entity toolset and Convex merge/delete mutations.
  • tinyfish-io/bigset#97: Both PRs update the investigate agent and tooling to rely on primary-key identity and primary-key-aware merge contracts, including primary-key-aware instructions and row write/update behavior.

Suggested reviewers

  • Jaredee123
  • simantak-dabhade
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main architectural restructuring from a two-tier to a three-tier populate agent system with triage-extract and investigate subagents, which is the primary change across the entire changeset.
Description check ✅ Passed The description is detailed, well-structured, and comprehensively covers the changeset including architecture changes, feature updates, file modifications, and testing steps that align with the raw summary.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mastra_agent_system_update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/src/mastra/agents/investigate.ts (1)

10-65: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

This refactor breaks the repo’s investigate-subagent contract.

The instructions and tool wiring now enforce an update_row_by_key-only agent, but this file is still documented to expose the four-tool investigate surface (insert_row, list_rows, search_web, fetch_page) from buildInvestigateAgent(authorizedDatasetId, authContext, columns). If this contract is intentionally changing, the repository guideline and any dependent callers/tests need to move in the same PR; otherwise this agent should keep the required interface.
As per coding guidelines, "Use buildInvestigateAgent(authorizedDatasetId, authContext, columns) factory to build per-entity subagents with captured dataset id for capability scoping" and "Investigate subagent must have exactly 4 tools: insert_row, list_rows, search_web, fetch_page and must return structured feedback (INSERTED/SUMMARY/CLUES/REASON)".

Also applies to: 79-93

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/agents/investigate.ts` around lines 10 - 65, The refactor
changed the investigate agent's behavior to only call update_row_by_key,
breaking the required investigate-subagent contract; restore the original 4-tool
interface and structured feedback by updating buildInvestigateInstructions and
the agent factory to document and expose the four tools (insert_row, list_rows,
search_web, fetch_page) and ensure the agent implementation created by
buildInvestigateAgent(authorizedDatasetId, authContext, columns) uses those
tools (not just update_row_by_key) and returns the exact final structured output
(INSERTED/SUMMARY/CLUES/REASON); if the contract is intentionally changing,
instead update all dependent callers/tests and repository guidelines
consistently to reflect the new single-tool behavior but do not leave this file
inconsistent.
🧹 Nitpick comments (1)
backend/package.json (1)

30-30: Confirm typescript@5.8.3 validity/security; flag non-security compiler regressions

  • typescript@5.8.3 is a published npm version; with ^5.8.3 it will stay on the 5.x line (won’t float to 6.x).
  • No known security vulnerabilities/CVEs are reported for typescript@5.8.3 in common vulnerability feeds.
  • Reported issues for TS 5.8.3 include compiler/typecheck crashes/regressions (reliability, not security); if CI/build failures occur, consider upgrading within the 5.x line.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/package.json` at line 30, The package.json currently specifies
"typescript": "^5.8.3"; confirm this version is acceptable by (1) checking
official npm/CVE/advisory feeds for any reported security vulnerabilities for
typescript@5.8.3 and documenting that none were found, (2) running the project
CI/build and full test/typecheck to detect any compiler/regression issues caused
by 5.8.3, and (3) if build/typecheck failures occur, update package.json to a
newer safe 5.x release (and update lockfile) rather than jumping to 6.x; if
everything passes, leave "^5.8.3" but note in the PR that regressions are a
known non-security risk.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/env.ts`:
- Around line 28-31: POPULATE_TARGET_ROWS is being set via
Number(process.env.BIGSET_POPULATE_TARGET_ROWS||"20") without validation, which
can yield NaN or non-positive values; update the initialization so you parse and
validate BIGSET_POPULATE_TARGET_ROWS (use parseInt or Number), ensure
Number.isInteger(value) and value > 0, and if invalid either fallback to the
default 20 or throw a clear error/log (modify the POPULATE_TARGET_ROWS
assignment and add validation logic around
process.env.BIGSET_POPULATE_TARGET_ROWS to enforce a positive integer).

In `@backend/src/mastra/agents/populate.ts`:
- Around line 12-72: The current orchestrator text in
buildOrchestratorInstructions changes the agent surface by referencing
extract_rows and list_rows and delegating fetches/writes; revert it to the
repository contract by using buildPopulateAgent(authorizedDatasetId,
authContext, columns) to construct the orchestrator and ensure the instructions
only reference the three authorized tools: search_web, fetch_page, and
investigate_row (remove references to extract_rows, list_rows, and any
delegation that performs fetches or writes); update any logic or stop-condition
wording to rely on list_rows/fetch_page/investigate_row behavior prescribed by
buildPopulateAgent and preserve the search-only boundary.

In `@backend/src/mastra/tools/investigate-tool.ts`:
- Around line 155-183: The insert path can overshoot targetRows because the
extract_rows check runs elsewhere; inside the execute handler for insert_row
(the async execute with params { primary_key, confidence, sources, data }) add a
guard right before calling convex.mutation(internal.datasetRows.insert, ...)
that checks the current completed count (use rowIndex.size or the same
completed-row counter you maintain) against targetRows and aborts the insert if
the cap is reached (log via logCtx and return a failure/skip result). Ensure you
perform this check before the mutation and only call rowIndex.set(primary_key,
...) after a successful insert; reference: execute, primary_key, targetRows,
rowIndex, convex.mutation(internal.datasetRows.insert), authorizedDatasetId.
- Around line 431-452: The investigate subagent is being spawned with
buildInvestigateAgent and limited to maxSteps: 20; update the launch to use
buildInvestigateTool(authorizedDatasetId, authContext, columns) and call its
generate with maxSteps: 25 instead of 20, then pass the resulting text into
parseInvestigateOutput as before (ensure you're replacing buildInvestigateAgent
and the agent.generate(..., { maxSteps: 20 }) call with the
buildInvestigateTool(...) invocation and generate(..., { maxSteps: 25 }) so the
subagent conforms to the required 25-step budget).

In `@frontend/convex/datasetRows.ts`:
- Around line 199-202: The completeness check in isComplete currently tests
every name in args.columnNames against data, but the contract excludes fields
starting with '_' from being required; update the logic in the isComplete
computation to first filter args.columnNames to exclude any column name that
starts with '_' (e.g., col.startsWith('_')) and then run the every(...) check
against data so underscore-prefixed fields are never considered required when
determining completeness.

---

Outside diff comments:
In `@backend/src/mastra/agents/investigate.ts`:
- Around line 10-65: The refactor changed the investigate agent's behavior to
only call update_row_by_key, breaking the required investigate-subagent
contract; restore the original 4-tool interface and structured feedback by
updating buildInvestigateInstructions and the agent factory to document and
expose the four tools (insert_row, list_rows, search_web, fetch_page) and ensure
the agent implementation created by buildInvestigateAgent(authorizedDatasetId,
authContext, columns) uses those tools (not just update_row_by_key) and returns
the exact final structured output (INSERTED/SUMMARY/CLUES/REASON); if the
contract is intentionally changing, instead update all dependent callers/tests
and repository guidelines consistently to reflect the new single-tool behavior
but do not leave this file inconsistent.

---

Nitpick comments:
In `@backend/package.json`:
- Line 30: The package.json currently specifies "typescript": "^5.8.3"; confirm
this version is acceptable by (1) checking official npm/CVE/advisory feeds for
any reported security vulnerabilities for typescript@5.8.3 and documenting that
none were found, (2) running the project CI/build and full test/typecheck to
detect any compiler/regression issues caused by 5.8.3, and (3) if
build/typecheck failures occur, update package.json to a newer safe 5.x release
(and update lockfile) rather than jumping to 6.x; if everything passes, leave
"^5.8.3" but note in the PR that regressions are a known non-security risk.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 32a5a17c-21f2-4a1e-a10e-f543d896dc3b

📥 Commits

Reviewing files that changed from the base of the PR and between 1e9cc8b and 4fac5bc.

⛔ Files ignored due to path filters (1)
  • backend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (10)
  • backend/package.json
  • backend/src/env.ts
  • backend/src/index.ts
  • backend/src/mastra/agents/investigate.ts
  • backend/src/mastra/agents/populate.ts
  • backend/src/mastra/agents/triage-extract.ts
  • backend/src/mastra/tools/investigate-tool.ts
  • backend/src/mastra/tools/web-tools.ts
  • backend/src/mastra/workflows/populate.ts
  • frontend/convex/datasetRows.ts

Comment thread backend/src/env.ts Outdated
Comment thread backend/src/mastra/agents/populate.ts Outdated
Comment thread backend/src/mastra/tools/investigate-tool.ts Outdated
Comment thread backend/src/mastra/tools/investigate-tool.ts
Comment thread backend/src/mastra/tools/investigate-tool.ts Outdated
Comment thread frontend/convex/datasetRows.ts
Remove the hard top-5 / top-10 count limits per batch. Instead, dispatch
every URL that clears a quality bar (relevance, data value, source
authority, novelty against already-complete rows). Consolidate the
redundant second/subsequent batch sections into a single loop. Steer
searches using list_rows output to avoid re-searching complete entities.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/src/mastra/agents/populate.ts (1)

91-98: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden targetRows parsing to prevent disabled caps on invalid env values.

Number(process.env.BIGSET_POPULATE_TARGET_ROWS || "20") can yield NaN/0/negative values, which can break cap/stop behavior and allow runaway extraction.

💡 Minimal fix
+const DEFAULT_TARGET_ROWS = (() => {
+  const parsed = Number(process.env.BIGSET_POPULATE_TARGET_ROWS);
+  return Number.isFinite(parsed) && parsed > 0 ? Math.floor(parsed) : 20;
+})();
+
 export function buildPopulateAgent(
   authorizedDatasetId: string,
   authContext: AuthContext,
   columns: PopulateColumn[],
-  targetRows: number = Number(process.env.BIGSET_POPULATE_TARGET_ROWS || "20"),
+  targetRows: number = DEFAULT_TARGET_ROWS,
 ): Agent {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/agents/populate.ts` around lines 91 - 98, The targetRows
env parsing is unsafe and can produce NaN/0/negative values which disable
extraction caps; update the parsing for targetRows (the variable passed into
buildExtractTool) to robustly parse and validate the environment value (use
parseInt/Number, ensure it's a finite positive integer, floor it) and fall back
to the default 20 when the parsed value is invalid or <= 0; optionally clamp to
a sensible maximum before passing into buildExtractTool to avoid runaway
extraction.
♻️ Duplicate comments (1)
backend/src/mastra/agents/populate.ts (1)

31-56: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Restore the orchestrator’s required tool surface (search_web, fetch_page, investigate_row).

This change moves orchestration to extract_rows/list_rows, which violates the required populate-agent contract and removes the mandated no-write boundary at the orchestrator layer.

As per coding guidelines, "Populate orchestrator agent must have exactly 3 tools: search_web, fetch_page, investigate_row with no write access".

Also applies to: 105-109

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/agents/populate.ts` around lines 31 - 56, The orchestrator
removed the mandated tool surface and must be restored to expose exactly the
three read-only tools: search_web, fetch_page, and investigate_row and must not
perform any write operations itself; revert or refactor the current
orchestration flow that moved responsibilities into extract_rows/list_rows so
that populate (or the populate-agent orchestrator function) only calls
search_web to get URL candidates, calls fetch_page to retrieve page content, and
calls investigate_row to analyze content and emit leads, and ensure
extract_rows/list_rows remain downstream workers (or are called by
investigate_row) rather than replacing the orchestrator tools; locate usages of
extract_rows, list_rows, and the populate-orchestrator function and change them
so the orchestrator exposes and calls the three tool functions (search_web,
fetch_page, investigate_row) and contains no writes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@backend/src/mastra/agents/populate.ts`:
- Around line 91-98: The targetRows env parsing is unsafe and can produce
NaN/0/negative values which disable extraction caps; update the parsing for
targetRows (the variable passed into buildExtractTool) to robustly parse and
validate the environment value (use parseInt/Number, ensure it's a finite
positive integer, floor it) and fall back to the default 20 when the parsed
value is invalid or <= 0; optionally clamp to a sensible maximum before passing
into buildExtractTool to avoid runaway extraction.

---

Duplicate comments:
In `@backend/src/mastra/agents/populate.ts`:
- Around line 31-56: The orchestrator removed the mandated tool surface and must
be restored to expose exactly the three read-only tools: search_web, fetch_page,
and investigate_row and must not perform any write operations itself; revert or
refactor the current orchestration flow that moved responsibilities into
extract_rows/list_rows so that populate (or the populate-agent orchestrator
function) only calls search_web to get URL candidates, calls fetch_page to
retrieve page content, and calls investigate_row to analyze content and emit
leads, and ensure extract_rows/list_rows remain downstream workers (or are
called by investigate_row) rather than replacing the orchestrator tools; locate
usages of extract_rows, list_rows, and the populate-orchestrator function and
change them so the orchestrator exposes and calls the three tool functions
(search_web, fetch_page, investigate_row) and contains no writes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b0a10655-2fca-4ffc-82f7-b3c3f0950c14

📥 Commits

Reviewing files that changed from the base of the PR and between 4fac5bc and 792737c.

📒 Files selected for processing (1)
  • backend/src/mastra/agents/populate.ts

MMeteorL and others added 2 commits May 25, 2026 13:00
- env.ts: validate BIGSET_POPULATE_TARGET_ROWS — NaN, zero, and
  negative values now fall back to the default of 20 instead of
  silently breaking the agent's stop condition.

- datasetRows.mergeUpdate (new Convex internalMutation): atomic
  per-field blank-aware merge. Blank cells are always filled with any
  non-empty incoming value regardless of confidence; non-blank cells are
  only overwritten when newConfidence > existing row confidence.
  The authoritative check lives in Convex so two concurrent investigate
  agents can't both pass a stale client-side confidence check and race
  to write — the compare-and-merge is serialized inside a single
  transaction. Quota is charged only on actual changes; history entries
  are recorded per changed field.

- investigate-tool.ts / buildUpdateRowByKeyTool: drop the row-wide
  confidence pre-check (which blocked filling blank columns on partial
  high-confidence rows) and call mergeUpdate instead of update. The
  local rowIndex is updated optimistically using the same per-field
  rules so subsequent calls within the run stay consistent without an
  extra Convex round-trip.

- datasetRows.deleteIncomplete: skip _-prefixed keys (_confidence,
  _sources, etc.) in the completeness check — they are internal fields
  that are never required columns.

- backend/CLAUDE.md: rewrite Mastra section to reflect the tri-agent
  architecture and new confidence/merge semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
One textual conflict in investigate-tool.ts: commit 1e9cc8b on main
("Unify local env loading") fixed a redundant `?? false` in the old
parseInvestigateResult return block. Our branch had already replaced
that entire function with parseTriageExtractOutput and
parseInvestigateOutput. Git's heuristic matched the `return { ... }`
structure across the two different functions and flagged a false
conflict. Resolution: keep HEAD (our new parsers are correct; the old
function no longer exists).

All other files merged cleanly:
- backend/src/env.ts: main's dotenv path consolidation + our
  POPULATE_TARGET_ROWS validation both land without conflict.
- frontend/convex/datasetRows.ts: main's rowCount denormalized counter
  (insert/remove/clearByDataset) merges alongside our mergeUpdate
  mutation and deleteIncomplete fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR #85 ("Make populate status truthful") refactored the /populate route
from synchronous (awaiting the workflow inline) to async: it now claims
the dataset, creates a run, fires runPopulateWorkflowInBackground() as
a background void promise, and returns 202 immediately.

Our branch still had the old synchronous pattern with the post-workflow
notification and status-update logic inline in the route handler —
including our deleteIncomplete pruning addition. The conflict was the
entire old inline block vs main's single 202 return.

Resolution:
- Take main's 202 return — the async pattern is strictly better.
- Move the deleteIncomplete pruning into runPopulateWorkflowInBackground()
  (already introduced by main), placed before the countByDataset call so
  we count only complete rows when deciding whether to set status "live".

All other files (populate.ts workflow, frontend status/schema changes)
merged cleanly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/src/env.ts (1)

37-40: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix POPULATE_TARGET_ROWS validation to reject non-integers before flooring.

Current logic accepts fractional values and can produce 0 (e.g., 0.5 -> 0), which can prematurely terminate populate runs. It also contradicts the comment that non-integers fallback to 20.

💡 Suggested patch
   POPULATE_TARGET_ROWS: (() => {
     const parsed = Number(process.env.BIGSET_POPULATE_TARGET_ROWS);
-    return Number.isFinite(parsed) && parsed > 0 ? Math.floor(parsed) : 20;
+    return Number.isInteger(parsed) && parsed > 0 ? parsed : 20;
   })(),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/env.ts` around lines 37 - 40, POPULATE_TARGET_ROWS currently
accepts fractional numbers because it checks Number.isFinite(parsed) then
floors, which lets values like 0.5 become 0; change the validation so only
integer positive values are accepted (use Number.isInteger(parsed) && parsed >
0) and return parsed as-is (or Math.floor only if you want to coerce negatives
handled) otherwise fall back to 20; update the initialization for
POPULATE_TARGET_ROWS in env.ts to perform Number.isInteger(parsed) && parsed > 0
before using the value.
frontend/convex/datasetRows.ts (1)

334-360: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

deleteIncomplete does not update the cached rowCount, causing stale dashboard counts.

insert, remove, and clearByDataset all maintain the denormalized rowCount field on the dataset document. However, deleteIncomplete calls ctx.db.delete(row._id) directly without decrementing the counter. After pruning, the dashboard will display a higher row count than actually exists until the next mutation that self-heals.

🐛 Proposed fix: update rowCount after deleting rows
 export const deleteIncomplete = internalMutation({
   args: {
     datasetId: v.id("datasets"),
     columnNames: v.array(v.string()),
   },
   handler: async (ctx, args) => {
     const rows = await ctx.db
       .query("datasetRows")
       .withIndex("by_dataset", (q) => q.eq("datasetId", args.datasetId))
       .collect();

     let deletedCount = 0;
     for (const row of rows) {
       const data = row.data as Record<string, unknown>;
       const isComplete = args.columnNames.every((col) => {
         if (col.startsWith("_")) return true; // skip internal fields (_confidence, _sources, etc.)
         const val = data[col];
         return val !== null && val !== undefined && val !== "";
       });
       if (!isComplete) {
         await ctx.db.delete(row._id);
         deletedCount++;
       }
     }
+
+    // Update the cached rowCount to stay in sync with actual rows.
+    if (deletedCount > 0) {
+      const dataset = await ctx.db.get(args.datasetId);
+      if (dataset) {
+        const newCount =
+          typeof dataset.rowCount === "number"
+            ? Math.max(0, dataset.rowCount - deletedCount)
+            : rows.length - deletedCount; // self-heal if rowCount was missing
+        await ctx.db.patch(args.datasetId, { rowCount: newCount });
+      }
+    }
+
     return { deletedCount };
   },
 });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/convex/datasetRows.ts` around lines 334 - 360, The deleteIncomplete
mutation removes rows but never updates the dataset's denormalized rowCount,
leaving stale counts; modify deleteIncomplete to mirror the behavior of
insert/remove/clearByDataset by decrementing the dataset document's rowCount
after deletions (use the same DB update/patch API used by insert/remove to
subtract deletedCount for args.datasetId), or update the rowCount incrementally
as you delete each row so the dataset's rowCount stays consistent with the
actual rows.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@backend/src/env.ts`:
- Around line 37-40: POPULATE_TARGET_ROWS currently accepts fractional numbers
because it checks Number.isFinite(parsed) then floors, which lets values like
0.5 become 0; change the validation so only integer positive values are accepted
(use Number.isInteger(parsed) && parsed > 0) and return parsed as-is (or
Math.floor only if you want to coerce negatives handled) otherwise fall back to
20; update the initialization for POPULATE_TARGET_ROWS in env.ts to perform
Number.isInteger(parsed) && parsed > 0 before using the value.

In `@frontend/convex/datasetRows.ts`:
- Around line 334-360: The deleteIncomplete mutation removes rows but never
updates the dataset's denormalized rowCount, leaving stale counts; modify
deleteIncomplete to mirror the behavior of insert/remove/clearByDataset by
decrementing the dataset document's rowCount after deletions (use the same DB
update/patch API used by insert/remove to subtract deletedCount for
args.datasetId), or update the rowCount incrementally as you delete each row so
the dataset's rowCount stays consistent with the actual rows.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fa06ba5d-6848-4387-adb5-d3e8897a9229

📥 Commits

Reviewing files that changed from the base of the PR and between befd505 and 6d9fd2d.

📒 Files selected for processing (5)
  • backend/package.json
  • backend/src/env.ts
  • backend/src/index.ts
  • backend/src/mastra/workflows/populate.ts
  • frontend/convex/datasetRows.ts
✅ Files skipped from review due to trivial changes (1)
  • backend/package.json

…igation

Replace the triage-extract agent with a leaner extract agent, move
investigate_entity to the orchestrator tier, and add pendingInserts
dedup for concurrent extract agents.

Architecture changes:
- triage-extract.ts → extract.ts (buildExtractAgent): no triage step, no
  investigate spawning; receives a batch of 1–5 URLs, fetches all in
  parallel, calls batch_insert_rows once with all entities combined, returns
  LEADS/SOURCE_QUALITY to the orchestrator
- Orchestrator now owns investigate_entity: after all parallel extract_rows
  calls finish, it calls list_rows then emits all investigate_entity calls
  simultaneously for every incomplete row
- buildExtractTool now returns { extractRowsTool, listRowsTool,
  investigateEntityTool } — the orchestrator receives all three
- pendingInserts Set added to closure: prevents two concurrent extract agents
  from double-inserting the same primary key (JS event-loop atomicity
  makes the check+add race-safe without Convex schema changes)
- extract_rows input accepts 1–5 URLs per call (was exactly 1)
- buildInsertRowTool removed (dead code)
- BIGSET_POPULATE_TARGET_ROWS env var wired through env.ts → workflow →
  buildPopulateAgent → buildExtractTool
- patchMastraSanitizeToolCallInput called at startup in index.ts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
backend/src/mastra/tools/model-middleware.ts (2)

57-88: 💤 Low value

Consider removing or reducing the verbosity of debug logging.

Lines 62 and 74 emit console.log for every stream and every tool-call chunk. In production with many concurrent agent runs, this will generate significant log noise. Consider removing line 62 entirely (it fires on every stream) and gating the log on line 74 behind a debug flag or only logging when repair actually occurs.

♻️ Proposed cleanup
 export function withToolCallRepair(model: any): any {
   return wrapLanguageModel({
     model,
     middleware: {
       wrapStream: async ({ doStream }: any) => {
-        console.log("[model-middleware] wrapStream called");
         const result = await doStream();
         const { stream, ...rest } = result;

         const fixedStream = stream.pipeThrough(
           new TransformStream({
             transform(chunk: any, controller: any) {
               if (
                 chunk != null &&
                 chunk.type === "tool-call" &&
                 typeof chunk.input === "string"
               ) {
-                console.log(`[model-middleware] tool-call chunk: ${chunk.toolName} input starts with: ${chunk.input.slice(0, 30)}`);
                 const fixedInput = tryUnwrapDoubleEncodedInput(chunk.input);
+                if (fixedInput !== chunk.input) {
+                  console.log(`[model-middleware] Repaired tool-call input for ${chunk.toolName}`);
+                }
                 controller.enqueue({ ...chunk, input: fixedInput });
               } else {
                 controller.enqueue(chunk);
               }
             },
           }),
         );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/tools/model-middleware.ts` around lines 57 - 88, Remove
the noisy unconditional logging in withToolCallRepair: delete the console.log
inside wrapStream that logs every stream, and change the console.log inside the
TransformStream transform to only run when a repair actually occurs or when a
debug flag is enabled; use the existing tryUnwrapDoubleEncodedInput result to
determine if a repair happened (i.e., only log when fixedInput differs from
chunk.input) or check a runtime debug flag before logging, keeping the rest of
the wrapStream/TransformStream logic intact.

103-146: ⚖️ Poor tradeoff

Global JSON.parse monkey-patching is risky and duplicates logic.

Patching the global JSON.parse affects every JSON operation in the application—Convex client, Fastify body parsing, all libraries. While the patch is defensive (only activates on parse failure with "-prefixed strings), this pattern makes debugging harder and could have subtle unintended effects.

Additionally, lines 123–136 duplicate the extraction logic already in tryUnwrapDoubleEncodedInput. Reuse the existing helper to maintain a single source of truth.

♻️ Proposed refactor to reuse helper
 export async function patchMastraSanitizeToolCallInput(): Promise<void> {
   try {
     const originalJsonParse = JSON.parse;
     (JSON as any).parse = function patchedJsonParse(text: string, ...rest: any[]) {
       try {
         return originalJsonParse.call(this, text, ...rest);
       } catch (err) {
-        // If JSON.parse fails on a string that starts with `"`, try the
-        // double-encoding recovery: extract the JSON object between the
-        // first `{` and last `}`.
-        if (typeof text === "string" && text.startsWith('"')) {
-          const firstBrace = text.indexOf("{");
-          const lastBrace = text.lastIndexOf("}");
-          if (firstBrace !== -1 && lastBrace > firstBrace) {
-            const candidate = text.slice(firstBrace, lastBrace + 1);
-            try {
-              const recovered = originalJsonParse.call(this, candidate, ...rest);
-              console.log(
-                `[model-middleware/patch] Recovered double-encoded JSON (${candidate.length} chars): ${candidate.slice(0, 60)}...`,
-              );
-              return recovered;
-            } catch {
-              // Recovery also failed — re-throw the original error
-            }
-          }
+        if (typeof text === "string") {
+          const repaired = tryUnwrapDoubleEncodedInput(text);
+          if (repaired !== text) {
+            return originalJsonParse.call(this, repaired, ...rest);
+          }
         }
         throw err;
       }
     };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/tools/model-middleware.ts` around lines 103 - 146,
patchMastraSanitizeToolCallInput currently monkey-patches global JSON.parse
(risking wide side-effects) and reimplements double-encoding extraction logic;
instead, remove the global monkey-patch and change
patchMastraSanitizeToolCallInput to call the existing
tryUnwrapDoubleEncodedInput helper when a parse fails in the local Mastra scope
(i.e., detect failure in the context where Mastra processes tool calls), use
originalJsonParse for the primary parse attempt, and on failure pass the input
to tryUnwrapDoubleEncodedInput to attempt recovery and logging; keep references
to originalJsonParse and ensure errors are re-thrown if recovery fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/mastra/tools/investigate-tool.ts`:
- Around line 236-243: Remove the client-side short-circuit that skips updating
when confidence <= existingEntry.confidence inside the loop in investigate-tool
(the block that pushes skipped.push(primary_key)); instead always call the
server-side atomic mergeUpdate (the function/mutation named mergeUpdate) for
existingEntry cases so blank-field fills and atomic per-field merge rules are
enforced by the Convex mutation; similarly remove the duplicate short-circuit at
the later check (the other confidence comparison around lines referenced) so all
decisioning happens server-side.
- Around line 795-803: The code currently only updates rowIndex when
convexConfidence > existingEntry.confidence, which ignores equal-confidence
updates that may fill blanks; in extract_rows ensure rowIndex is refreshed from
Convex at the start of the function to pick up parallel-agent writes, and change
the update condition inside the loop (where convexConfidence, existingEntry,
rowIndex.set are used) to treat equal-confidence as an update (use >= or
explicitly check for equality and compare cells) so you still overwrite/merge
cells when convexConfidence === existingEntry.confidence.
- Around line 215-227: The intra-batch dedup currently drops later rows when
seenInBatch.has(primary_key) is true; instead, change the logic in
investigate-tool.ts around the loop over rows so duplicates are merged into the
first-seen record: when encountering a row with an existing primary_key, do not
push to skipped, but merge its non-empty fields into the stored record (e.g.,
for scalar fields prefer the non-empty value or choose the max confidence),
merge array/collection fields like sources by deduplicating and concatenating,
and deep-merge the data object fields preferring non-null values; update the
stored entry referenced by primary_key (or maintain a map keyed by primary_key)
so the final collection contains merged entries for persistence.

---

Nitpick comments:
In `@backend/src/mastra/tools/model-middleware.ts`:
- Around line 57-88: Remove the noisy unconditional logging in
withToolCallRepair: delete the console.log inside wrapStream that logs every
stream, and change the console.log inside the TransformStream transform to only
run when a repair actually occurs or when a debug flag is enabled; use the
existing tryUnwrapDoubleEncodedInput result to determine if a repair happened
(i.e., only log when fixedInput differs from chunk.input) or check a runtime
debug flag before logging, keeping the rest of the wrapStream/TransformStream
logic intact.
- Around line 103-146: patchMastraSanitizeToolCallInput currently monkey-patches
global JSON.parse (risking wide side-effects) and reimplements double-encoding
extraction logic; instead, remove the global monkey-patch and change
patchMastraSanitizeToolCallInput to call the existing
tryUnwrapDoubleEncodedInput helper when a parse fails in the local Mastra scope
(i.e., detect failure in the context where Mastra processes tool calls), use
originalJsonParse for the primary parse attempt, and on failure pass the input
to tryUnwrapDoubleEncodedInput to attempt recovery and logging; keep references
to originalJsonParse and ensure errors are re-thrown if recovery fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f7506fc9-a943-4524-a4dc-9ae3c32de468

📥 Commits

Reviewing files that changed from the base of the PR and between 6d9fd2d and bbbb67b.

📒 Files selected for processing (8)
  • .env.example
  • backend/CLAUDE.md
  • backend/src/index.ts
  • backend/src/mastra/agents/extract.ts
  • backend/src/mastra/agents/populate.ts
  • backend/src/mastra/tools/investigate-tool.ts
  • backend/src/mastra/tools/model-middleware.ts
  • backend/src/mastra/workflows/populate.ts
✅ Files skipped from review due to trivial changes (1)
  • backend/CLAUDE.md

Comment thread backend/src/mastra/tools/investigate-tool.ts
Comment on lines +236 to +243
if (existingEntry) {
// ── Update path: row already exists ────────────────────────────────
if (confidence <= existingEntry.confidence) {
// Equal or higher confidence already on record — nothing to do.
skipped.push(primary_key);
continue;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the local confidence short-circuit before calling mergeUpdate.

Skipping on cached confidence <= existingEntry.confidence prevents valid blank-field fills that the atomic server merge would accept, and makes decisions from stale in-memory state.

Suggested fix
-          if (confidence <= existingEntry.confidence) {
-            // Equal or higher confidence already on record — nothing to do.
-            skipped.push(primary_key);
-            continue;
-          }
-
           console.log(
             `[batch_insert_rows] ${logCtx} pk="${primary_key}" updating ` +
               `(confidence ${existingEntry.confidence.toFixed(2)}→${confidence.toFixed(2)})`,
           );
           try {
-            await convex.mutation(internal.datasetRows.mergeUpdate, {
+            const result = await convex.mutation(internal.datasetRows.mergeUpdate, {
               id: existingEntry.rowId as any,
               expectedDatasetId: authorizedDatasetId,
               newData: cleanedData,
               newConfidence: confidence,
               newSources: sources,
             });
+            if (!result.merged) {
+              skipped.push(primary_key);
+              continue;
+            }

As per coding guidelines, "Enforce per-field blank-aware merge rules atomically in Convex mutations, not in client-side tool code, to prevent race conditions during parallel agent runs".

Also applies to: 249-255

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/mastra/tools/investigate-tool.ts` around lines 236 - 243, Remove
the client-side short-circuit that skips updating when confidence <=
existingEntry.confidence inside the loop in investigate-tool (the block that
pushes skipped.push(primary_key)); instead always call the server-side atomic
mergeUpdate (the function/mutation named mergeUpdate) for existingEntry cases so
blank-field fills and atomic per-field merge rules are enforced by the Convex
mutation; similarly remove the duplicate short-circuit at the later check (the
other confidence comparison around lines referenced) so all decisioning happens
server-side.

Comment thread backend/src/mastra/tools/investigate-tool.ts
MMeteorL and others added 5 commits May 27, 2026 01:37
… call

- All three agents (orchestrator, extract, investigate) now use
  deepseek/deepseek-v4-pro via OpenRouter
- extract_rows input reverted to exactly 1 URL (max(1)) — the orchestrator
  dispatches one parallel extract_rows call per URL; no batching needed
  since the orchestrator handles investigate_entity directly
- Extract agent instructions updated accordingly (single fetch, single page)
- Orchestrator instructions updated: "one URL per call" instead of batches

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…iterations

- maxSteps: 40 → 20 for extract agents — prevents paginated browse pages
  (metacritic /browse/) from spiraling to 38-40 steps (~10 min per call);
  they now exit at 20 steps and return LEADS for the next iteration instead
- Replace full buildExistingRowsText() dump with a compact count + 30-key
  sample in each extract agent prompt — row dedup is handled by the tool
  layer (rowIndex + pendingInserts), not by the agent, so the full 300-row
  dump was wasted context that inflated model processing time
- Add URL QUALITY guidance to orchestrator Phase 2 instructions: prefer
  single-page editorial/list sources (Wikipedia, "best of", rankings);
  avoid paginated browse/catalog URLs (/browse/, ?page=, ?sort=, etc.)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ise orchestrator steps

Extract agent (agents/extract.ts):
- Hard budget of 2 tool calls: exactly 1 fetch_page + 1 batch_insert_rows
- Explicit: no pagination, no following links — add pagination URLs to LEADS instead
- HARD BUDGET section at top of instructions so the constraint is unmissable

Investigate agent (agents/investigate.ts):
- Remove second search round ("if first round didn't fill everything, search more")
- Procedure is now: 1-2 parallel searches → 1-2 fetches → 1 update_row_by_key → done
- Borrowed from main branch: shorter, more decisive, no retry loop

investigate-tool.ts:
- Extract agent maxSteps: 20 → 5 (2 tool calls + 3 buffer)
- Investigate agent maxSteps: 20 → 8 (search + fetch + update + buffer)

Orchestrator (agents/populate.ts):
- Phase 2: hard cap of ceil(targetRows/4) extract_rows calls per iteration
  (was soft "up to" guidance — LLM was ignoring it)
- Phase 4: hard batch of 20 investigate_entity calls per iteration, emitted in
  a single parallel response, prioritised by fewest missing columns
- Both caps are called out as HARD LIMITS in the instructions and the RULES block

Workflow (workflows/populate.ts):
- Orchestrator maxSteps: 80 → 150 (headroom for 4+ iterations at ~31 steps each)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eration counter

Root cause (from run logs): the orchestrator dispatched 13 extract_rows calls
(hard cap = 5) and the metacritic extract agent made 5 fetch_page calls using
all its maxSteps budget before ever calling batch_insert_rows. Both limits were
soft LLM instructions — the model ignored them.

Single-use fetch_page wrapper (investigate-tool.ts + web-tools.ts + extract.ts):
- Extract executeFetchPage() logic into a shared export in web-tools.ts
- buildExtractAgent() now receives fetchTool as a parameter (like batchInsertRowsTool)
- In extract_rows.execute, create a per-call onceFetchTool that wraps executeFetchPage
  with a fetchUsed boolean: the second call returns a hard error message telling
  the agent to call batch_insert_rows immediately with what it already has
- This makes the "one fetch per extract agent" limit physically unbypassable

Per-iteration extract counter (investigate-tool.ts):
- Add iterationExtractCount and MAX_EXTRACT_PER_ITER = ceil(targetRows/4) to closure
- Synchronous check+increment before the first await is atomic in JS's event loop
  (same pattern as pendingInserts) — safe under parallel extract_rows calls
- Calls beyond the cap return immediately with the URL as a LEAD for next iteration
- list_rows.execute resets the counter (list_rows is called once at Phase 2→3 boundary)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…epseek-v4-pro)

Delete backend/src/mastra/tools/model-middleware.ts and remove its startup
call from index.ts. The file contained two kimi-k2-specific workarounds:

- withToolCallRepair (dead code — no agent ever used it)
- patchMastraSanitizeToolCallInput (patched global JSON.parse to recover
  double-encoded tool-call arguments that kimi-k2 emitted)

deepseek/deepseek-v4-pro does not exhibit the double-encoding bug so both
repairs are unnecessary. Removing them eliminates the monkey-patch on the
global JSON.parse and the now-spurious TypeScript error on the wrapStream
middleware type.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants