feat: improve scientific RAG citations by Vinzz2303 · Pull Request #4 · aietal/aimengpt

Vinzz2303 · 2026-05-13T14:01:26Z

Summary

Improves the existing RAG pipeline for scientific/document QA
Adds scientific-aware PDF chunking, citation metadata, retrieval distances, and stricter citation prompting
Adds tests for citation key generation, section detection, metadata construction, and retrieved document formatting

Motivation

The existing document chat flow uploads PDFs into Chroma and answers from retrieved chunks, but citations are hard to trace back to stable document/page/chunk identifiers. This makes scientific QA harder to verify. This PR adds a small citation layer around the existing RAG flow so answers can cite exact retrieved chunks such as paper-title:p3:c2.

What changed

Added ui/utils/server/scientific-rag.ts with helpers for:
- scientific section detection
- scientific text split separators
- stable citation key generation
- Chroma metadata construction
- retrieved document formatting
Updated PDF ingestion to:
- use section-aware separators
- keep title/page/source/section/chunk/citation metadata
- remove noisy document logging
Updated retrieval to:
- use CHROMA_PATH consistently
- request 6 results by default, capped at 10
- include retrieval distances with documents and metadata
Updated RAG chat prompting to:
- pass formatted citation keys into the context
- require citation keys for factual claims
- prefer lower-distance sources when multiple chunks contain similar information
Added unit tests for the new scientific RAG helpers

Testing

From ui/:

npm test -- scientific-rag.test.ts --run
npx tsc --noEmit
npm run lint

Results:

Targeted Vitest suite passed: 4 tests
TypeScript check passed
Lint passed with pre-existing React hook dependency warnings in unrelated files

Vinzz2303 · 2026-05-13T14:15:51Z

Hi, I opened a PR for the ISAAC/AimenGPT RAG bounty: #4

It improves the existing RAG pipeline with scientific-aware PDF chunking, stable citation keys, retrieval distances, stricter citation prompting, and tests. Could you confirm whether this fits the bounty scope?

Vinzz2303 · 2026-05-14T15:34:31Z

I pushed an additional hardening commit to PR #4.

New improvements:

page-local citation keys so scientific citations stay easier to verify per document page
retrieval endpoint validation for method, empty queries, and bounded nResults
same-origin RAG document fetch instead of hard-coded localhost, improving deployment/container compatibility
added test coverage for page-local citation metadata

Validation:

npm test -- --run passed: 16 tests
npm run lint passed with only pre-existing React hook warnings in unrelated files

cerredz · 2026-05-15T22:51:36Z

I did an independent local validation of this PR on Windows with Node/npm from the ui/ directory.\n\nCommands/results:\n\n-
pm ci: completed successfully. npm reports 53 existing lockfile vulnerabilities, but install completed.\n-
px vitest run tests/scientific-rag.test.ts --reporter verbose: passed, 5 tests succeeded / 0 failed.\n-
px tsc --noEmit: passed.\n-
pm run lint: passed with the same React hook dependency warnings in unrelated files.\n\nOne environment note: the broad
pm test -- --run command did not return in this Windows shell within 5 minutes, so I stopped that process and reran the changed scientific RAG test file directly. The targeted PR test surface passed cleanly.

cerredz · 2026-05-15T22:53:31Z

One small follow-up from reading the scientific section helper: materials and methods was checked after methods, so that compound heading would be classified as methods. I opened a narrow PR against the source branch with a reorder and regression assertion: Vinzz2303#1

Validation on the follow-up branch:

npx vitest run tests/scientific-rag.test.ts --reporter verbose
npx tsc --noEmit
git diff --check

Merge follow-up regression for Materials and Methods section detection.

Vinzz2303 · 2026-05-16T06:42:02Z

Update: I reviewed and merged the follow-up PR from @cerredz into this branch. It tightens scientific section detection so Materials and Methods is classified before the shorter methods label, with a regression assertion added.

Validation after merging:

cd ui
npx vitest run __tests__/scientific-rag.test.ts --reporter verbose
# 5 passed

npx tsc --noEmit
# passed

cd ..
git diff --check
# no output

Vinzz2303 · 2026-05-17T15:56:00Z

Pushed one more narrow hardening follow-up to this PR.\n\nNew changes:\n- handle empty or unexpectedly shaped Chroma retrieval results without throwing in the RAG chat path\n- return an explicit no-documents context when retrieval produces no usable chunks\n- guard PDF ingestion against missing/non-array upload payloads before constructing the PDF loader\n- add regression coverage for defensive Chroma retrieval formatting\n\nValidation from ui/:\n-
px vitest run tests/scientific-rag.test.ts --reporter verbose -> 6 passed\n-
px tsc --noEmit -> passed\n-
pm run lint -> passed with pre-existing React hook dependency warnings in unrelated files\n- git diff --check -> clean

feat: improve scientific RAG citations

cf379f1

fix: harden scientific rag retrieval citations

9fd0f8b

Fix scientific section specificity

cef4416

Merge follow-up regression for Materials and Methods section detection.

fix: harden scientific rag empty retrieval handling

7bd30ba

nscarjr mentioned this pull request May 18, 2026

feat: add scientific research context to RAG #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve scientific RAG citations#4

feat: improve scientific RAG citations#4
Vinzz2303 wants to merge 4 commits into
aietal:masterfrom
Vinzz2303:improve-scientific-rag-citations

Vinzz2303 commented May 13, 2026

Uh oh!

Vinzz2303 commented May 13, 2026

Uh oh!

Vinzz2303 commented May 14, 2026

Uh oh!

cerredz commented May 15, 2026

Uh oh!

cerredz commented May 15, 2026

Uh oh!

Vinzz2303 commented May 16, 2026

Uh oh!

Vinzz2303 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Vinzz2303 commented May 13, 2026

Summary

Motivation

What changed

Testing

Uh oh!

Vinzz2303 commented May 13, 2026

Uh oh!

Vinzz2303 commented May 14, 2026

Uh oh!

cerredz commented May 15, 2026

Uh oh!

cerredz commented May 15, 2026

Uh oh!

Vinzz2303 commented May 16, 2026

Uh oh!

Vinzz2303 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants