Improve scientific RAG citations#9
Open
Yongzie wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ISAAC-497 PR Report
Summary
This patch improves the existing uploaded-document RAG path so it behaves more like a scientific/research workflow:
stores stronger citation metadata for uploaded PDF chunks
formats retrieved sources with stable bracketed source IDs like [S1]
includes page, DOI, year, and retrieval distance when available
makes retrieval depth configurable through RAG_TOP_K or request body nResults
updates the RAG prompt so factual claims must cite retrieved scientific sources
adds request validation for document upload and retrieval API routes
Files Changed
ui/utils/server/scientific-rag.ts
ui/pages/api/inject-documents.ts
ui/pages/api/fetch-documents.ts
ui/pages/api/rag-chat.ts
Validation
I could not run the full Next.js build in this workspace because dependencies are not installed. Recommended maintainer validation:
cd ui
npm install
npm run lint
npm run build
Manual validation path:
Start Chroma and the UI with the existing Docker flow.
Upload a research PDF.
Ask a question that requires evidence from the PDF.
Confirm the answer cites source IDs like [S1].
Confirm the retrieved context includes page metadata and DOI/year when available.
Suggested PR Title
Improve scientific RAG citations and retrieval metadata
Suggested PR Body
This PR addresses part of ISAAC-497 by strengthening the current uploaded-document RAG pipeline for scientific workflows.
Changes included:
Added a small scientific RAG utility for citation metadata extraction and retrieved-source formatting.
Preserved source metadata during PDF ingestion, including title, page, source path, source type, chunk index, citation key, DOI, author, and year when available.
Made Chroma retrieval configurable through RAG_TOP_K or request nResults.
Included retrieval distances in the formatted source context.
Updated the RAG prompt to require bracketed source citations like [S1] for factual claims.
Added basic API method/file validation around upload and retrieval routes.
This is intentionally scoped as an incremental improvement to the existing Chroma/LangChain implementation rather than a full framework rewrite. It should make the current RAG behavior easier to validate and extend toward the broader ISAAC-497 goals around scientific document management, AI access to uploaded documents, performance tuning, and reliable citations.
Validation note: I was not able to run the full build in my local workspace because dependencies were not installed there. Recommended validation is cd ui && npm install && npm run lint && npm run build.
Suggested Maintainer Comment
Hi Isaac team, I prepared an initial implementation for ISAAC-497 focused on the existing uploaded-document RAG path.
It improves scientific citation handling, stores richer PDF chunk metadata, formats retrieved context with stable source IDs like [S1], exposes configurable retrieval depth, and updates the answer prompt so factual claims must cite retrieved sources.
If this direction matches the bounty expectations, I can continue with the next slice: Semantic Scholar reference ingestion/unification and a small retrieval evaluation harness.
For bounty payout, I understand this should go through Algora after the PR is reviewed and merged.