Skip to content

Roll up collection pipeline through PR 52#52

Draft
giaphutran12 wants to merge 31 commits into
mainfrom
codex/collection-official-website-sources
Draft

Roll up collection pipeline through PR 52#52
giaphutran12 wants to merge 31 commits into
mainfrom
codex/collection-official-website-sources

Conversation

@giaphutran12
Copy link
Copy Markdown
Collaborator

@giaphutran12 giaphutran12 commented May 22, 2026

Summary

This is the stable base PR for the collection stack through PR #52. It keeps the head branch codex/collection-official-website-sources intact so Mengzhe can keep working from that branch.

Includes the collection/source-evidence stack up to the URL-field source evidence fix. Later self-healing, process-trace, browser-action, Playwright draft, and demo-path work is intentionally kept out of this PR.

Why this PR exists

Mengzhe is working from codex/collection-official-website-sources. This PR makes that branch reviewable against main without rewriting or force-pushing the branch he based work on.

Evidence from original stacked PR

  • URL-like dataset cells such as official_website, company_website, careers_page_url, product_url, and docs_url count as source evidence.
  • Benchmark rescoring can resolve saved artifact paths across worktrees.
  • Original checks passed on the stacked PR: make verify-self-healing, benchmark tests, node --check, and git diff --check.

Notes

No force-push. No raw benchmark artifacts committed. No auto-merge.

Simantak Dabhade and others added 30 commits May 21, 2026 21:07
Introduces the "Clear & Populate" flow: an AI agent (Claude Sonnet 4.6
via OpenRouter) searches the web using TinyFish APIs, fetches page
content, and inserts real data into datasets row by row.

Backend:
- Mastra populate workflow (clear rows → build prompt → run agent)
- Populate agent with 7 tools: 5 database CRUD (insert, list, get,
  update, delete) + 2 web (search_web via TinyFish Search API,
  fetch_page via TinyFish Fetch API)
- All tools return structured errors so the agent can self-correct
- Data keys are sanitized to strip stray quotes/backticks from LLM output
- Fetch responses capped at 15K chars to protect agent context window
- Convex client uses anyApi to avoid cross-project imports in Docker
- POST /populate route with Clerk JWT auth

Frontend:
- "Clear & Populate" button on dataset detail page
- API client function in lib/backend.ts
- Rows appear in realtime via Convex reactive queries

Convex:
- New internal functions: datasetRows.get (query) and datasetRows.remove
  (mutation) for single-row read/delete

Infra:
- TINYFISH_API_KEY wired through docker-compose.dev.yml to backend
  and mastra services

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Enforce dataset ownership on POST /populate by querying Convex for
  the dataset and comparing ownerId to req.auth.userId before running
  the workflow (fixes authz gap)
- Remove raw row payloads from insert_row/update_row logs, log column
  count instead to avoid PII leakage
- Add 30s AbortController timeouts to both TinyFish fetch calls in
  web-tools.ts so they can't hang indefinitely
- Align PopulateResult type (rows → result) to match actual backend
  response shape

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Convex query for dataset lookup can throw on invalid IDs — wrapping
it in the existing try/catch ensures controlled 400 responses instead
of unhandled 500s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a702b8a8-2368-436d-9bd6-b3354fd1d382

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/collection-official-website-sources

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@giaphutran12 giaphutran12 self-assigned this May 22, 2026
@giaphutran12 giaphutran12 changed the title Fix collection URL-field source evidence Roll up collection pipeline through PR 52 May 23, 2026
@giaphutran12 giaphutran12 changed the base branch from codex/collection-evidence-support to main May 23, 2026 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant