Skip to content

Improve scientific RAG citations#9

Open
Yongzie wants to merge 1 commit into
aietal:masterfrom
Yongzie:isaac-497-scientific-rag
Open

Improve scientific RAG citations#9
Yongzie wants to merge 1 commit into
aietal:masterfrom
Yongzie:isaac-497-scientific-rag

Conversation

@Yongzie
Copy link
Copy Markdown

@Yongzie Yongzie commented May 14, 2026

ISAAC-497 PR Report
Summary
This patch improves the existing uploaded-document RAG path so it behaves more like a scientific/research workflow:

stores stronger citation metadata for uploaded PDF chunks
formats retrieved sources with stable bracketed source IDs like [S1]
includes page, DOI, year, and retrieval distance when available
makes retrieval depth configurable through RAG_TOP_K or request body nResults
updates the RAG prompt so factual claims must cite retrieved scientific sources
adds request validation for document upload and retrieval API routes
Files Changed
ui/utils/server/scientific-rag.ts
ui/pages/api/inject-documents.ts
ui/pages/api/fetch-documents.ts
ui/pages/api/rag-chat.ts
Validation
I could not run the full Next.js build in this workspace because dependencies are not installed. Recommended maintainer validation:

cd ui
npm install
npm run lint
npm run build
Manual validation path:

Start Chroma and the UI with the existing Docker flow.
Upload a research PDF.
Ask a question that requires evidence from the PDF.
Confirm the answer cites source IDs like [S1].
Confirm the retrieved context includes page metadata and DOI/year when available.
Suggested PR Title
Improve scientific RAG citations and retrieval metadata

Suggested PR Body
This PR addresses part of ISAAC-497 by strengthening the current uploaded-document RAG pipeline for scientific workflows.

Changes included:

Added a small scientific RAG utility for citation metadata extraction and retrieved-source formatting.
Preserved source metadata during PDF ingestion, including title, page, source path, source type, chunk index, citation key, DOI, author, and year when available.
Made Chroma retrieval configurable through RAG_TOP_K or request nResults.
Included retrieval distances in the formatted source context.
Updated the RAG prompt to require bracketed source citations like [S1] for factual claims.
Added basic API method/file validation around upload and retrieval routes.
This is intentionally scoped as an incremental improvement to the existing Chroma/LangChain implementation rather than a full framework rewrite. It should make the current RAG behavior easier to validate and extend toward the broader ISAAC-497 goals around scientific document management, AI access to uploaded documents, performance tuning, and reliable citations.

Validation note: I was not able to run the full build in my local workspace because dependencies were not installed there. Recommended validation is cd ui && npm install && npm run lint && npm run build.

Suggested Maintainer Comment
Hi Isaac team, I prepared an initial implementation for ISAAC-497 focused on the existing uploaded-document RAG path.

It improves scientific citation handling, stores richer PDF chunk metadata, formats retrieved context with stable source IDs like [S1], exposes configurable retrieval depth, and updates the answer prompt so factual claims must cite retrieved sources.

If this direction matches the bounty expectations, I can continue with the next slice: Semantic Scholar reference ingestion/unification and a small retrieval evaluation harness.

For bounty payout, I understand this should go through Algora after the PR is reviewed and merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant