Skip to content

feat: improve scientific RAG citations#4

Open
Vinzz2303 wants to merge 4 commits into
aietal:masterfrom
Vinzz2303:improve-scientific-rag-citations
Open

feat: improve scientific RAG citations#4
Vinzz2303 wants to merge 4 commits into
aietal:masterfrom
Vinzz2303:improve-scientific-rag-citations

Conversation

@Vinzz2303
Copy link
Copy Markdown

Summary

  • Improves the existing RAG pipeline for scientific/document QA
  • Adds scientific-aware PDF chunking, citation metadata, retrieval distances, and stricter citation prompting
  • Adds tests for citation key generation, section detection, metadata construction, and retrieved document formatting

Motivation

The existing document chat flow uploads PDFs into Chroma and answers from retrieved chunks, but citations are hard to trace back to stable document/page/chunk identifiers. This makes scientific QA harder to verify. This PR adds a small citation layer around the existing RAG flow so answers can cite exact retrieved chunks such as paper-title:p3:c2.

What changed

  • Added ui/utils/server/scientific-rag.ts with helpers for:
    • scientific section detection
    • scientific text split separators
    • stable citation key generation
    • Chroma metadata construction
    • retrieved document formatting
  • Updated PDF ingestion to:
    • use section-aware separators
    • keep title/page/source/section/chunk/citation metadata
    • remove noisy document logging
  • Updated retrieval to:
    • use CHROMA_PATH consistently
    • request 6 results by default, capped at 10
    • include retrieval distances with documents and metadata
  • Updated RAG chat prompting to:
    • pass formatted citation keys into the context
    • require citation keys for factual claims
    • prefer lower-distance sources when multiple chunks contain similar information
  • Added unit tests for the new scientific RAG helpers

Testing

From ui/:

npm test -- scientific-rag.test.ts --run
npx tsc --noEmit
npm run lint

Results:

  • Targeted Vitest suite passed: 4 tests
  • TypeScript check passed
  • Lint passed with pre-existing React hook dependency warnings in unrelated files

@Vinzz2303
Copy link
Copy Markdown
Author

Hi, I opened a PR for the ISAAC/AimenGPT RAG bounty: #4

It improves the existing RAG pipeline with scientific-aware PDF chunking, stable citation keys, retrieval distances, stricter citation prompting, and tests. Could you confirm whether this fits the bounty scope?

@Vinzz2303
Copy link
Copy Markdown
Author

I pushed an additional hardening commit to PR #4.

New improvements:

  • page-local citation keys so scientific citations stay easier to verify per document page
  • retrieval endpoint validation for method, empty queries, and bounded nResults
  • same-origin RAG document fetch instead of hard-coded localhost, improving deployment/container compatibility
  • added test coverage for page-local citation metadata

Validation:

  • npm test -- --run passed: 16 tests
  • npm run lint passed with only pre-existing React hook warnings in unrelated files

@cerredz
Copy link
Copy Markdown

cerredz commented May 15, 2026

I did an independent local validation of this PR on Windows with Node/npm from the ui/ directory.\n\nCommands/results:\n\n-
pm ci: completed successfully. npm reports 53 existing lockfile vulnerabilities, but install completed.\n-
px vitest run tests/scientific-rag.test.ts --reporter verbose: passed, 5 tests succeeded / 0 failed.\n-
px tsc --noEmit: passed.\n-
pm run lint: passed with the same React hook dependency warnings in unrelated files.\n\nOne environment note: the broad
pm test -- --run command did not return in this Windows shell within 5 minutes, so I stopped that process and reran the changed scientific RAG test file directly. The targeted PR test surface passed cleanly.

@cerredz
Copy link
Copy Markdown

cerredz commented May 15, 2026

One small follow-up from reading the scientific section helper: materials and methods was checked after methods, so that compound heading would be classified as methods. I opened a narrow PR against the source branch with a reorder and regression assertion: Vinzz2303#1

Validation on the follow-up branch:

  • npx vitest run tests/scientific-rag.test.ts --reporter verbose
  • npx tsc --noEmit
  • git diff --check

Merge follow-up regression for Materials and Methods section detection.
@Vinzz2303
Copy link
Copy Markdown
Author

Update: I reviewed and merged the follow-up PR from @cerredz into this branch. It tightens scientific section detection so Materials and Methods is classified before the shorter methods label, with a regression assertion added.

Validation after merging:

cd ui
npx vitest run __tests__/scientific-rag.test.ts --reporter verbose
# 5 passed

npx tsc --noEmit
# passed

cd ..
git diff --check
# no output

@Vinzz2303
Copy link
Copy Markdown
Author

Pushed one more narrow hardening follow-up to this PR.\n\nNew changes:\n- handle empty or unexpectedly shaped Chroma retrieval results without throwing in the RAG chat path\n- return an explicit no-documents context when retrieval produces no usable chunks\n- guard PDF ingestion against missing/non-array upload payloads before constructing the PDF loader\n- add regression coverage for defensive Chroma retrieval formatting\n\nValidation from ui/:\n-
px vitest run tests/scientific-rag.test.ts --reporter verbose -> 6 passed\n-
px tsc --noEmit -> passed\n-
pm run lint -> passed with pre-existing React hook dependency warnings in unrelated files\n- git diff --check -> clean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants