Improve scientific RAG citations by Yongzie · Pull Request #9 · aietal/aimengpt

Yongzie · 2026-05-14T14:46:17Z

ISAAC-497 PR Report
Summary
This patch improves the existing uploaded-document RAG path so it behaves more like a scientific/research workflow:

stores stronger citation metadata for uploaded PDF chunks
formats retrieved sources with stable bracketed source IDs like [S1]
includes page, DOI, year, and retrieval distance when available
makes retrieval depth configurable through RAG_TOP_K or request body nResults
updates the RAG prompt so factual claims must cite retrieved scientific sources
adds request validation for document upload and retrieval API routes
Files Changed
ui/utils/server/scientific-rag.ts
ui/pages/api/inject-documents.ts
ui/pages/api/fetch-documents.ts
ui/pages/api/rag-chat.ts
Validation
I could not run the full Next.js build in this workspace because dependencies are not installed. Recommended maintainer validation:

cd ui
npm install
npm run lint
npm run build
Manual validation path:

Start Chroma and the UI with the existing Docker flow.
Upload a research PDF.
Ask a question that requires evidence from the PDF.
Confirm the answer cites source IDs like [S1].
Confirm the retrieved context includes page metadata and DOI/year when available.
Suggested PR Title
Improve scientific RAG citations and retrieval metadata

Suggested PR Body
This PR addresses part of ISAAC-497 by strengthening the current uploaded-document RAG pipeline for scientific workflows.

Changes included:

Added a small scientific RAG utility for citation metadata extraction and retrieved-source formatting.
Preserved source metadata during PDF ingestion, including title, page, source path, source type, chunk index, citation key, DOI, author, and year when available.
Made Chroma retrieval configurable through RAG_TOP_K or request nResults.
Included retrieval distances in the formatted source context.
Updated the RAG prompt to require bracketed source citations like [S1] for factual claims.
Added basic API method/file validation around upload and retrieval routes.
This is intentionally scoped as an incremental improvement to the existing Chroma/LangChain implementation rather than a full framework rewrite. It should make the current RAG behavior easier to validate and extend toward the broader ISAAC-497 goals around scientific document management, AI access to uploaded documents, performance tuning, and reliable citations.

Validation note: I was not able to run the full build in my local workspace because dependencies were not installed there. Recommended validation is cd ui && npm install && npm run lint && npm run build.

Suggested Maintainer Comment
Hi Isaac team, I prepared an initial implementation for ISAAC-497 focused on the existing uploaded-document RAG path.

It improves scientific citation handling, stores richer PDF chunk metadata, formats retrieved context with stable source IDs like [S1], exposes configurable retrieval depth, and updates the answer prompt so factual claims must cite retrieved sources.

If this direction matches the bounty expectations, I can continue with the next slice: Semantic Scholar reference ingestion/unification and a small retrieval evaluation harness.

For bounty payout, I understand this should go through Algora after the PR is reviewed and merged.

Improve scientific RAG citations

e635867

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scientific RAG citations#9

Improve scientific RAG citations#9
Yongzie wants to merge 1 commit into
aietal:masterfrom
Yongzie:isaac-497-scientific-rag

Yongzie commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yongzie commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant