Clinical-genomic-rag-eval-demo

Clinical + Genomic RAG evaluation on de-identified synthetic data with transparent grounding and reproducible metrics.
Python 3.10+. No runtime deps beyond the standard library (tests use pytest).

Why

Clinical answers must be grounded and auditable. This demo:

enforces citations (doc_id + quote spans),
measures retrieval quality and faithfulness,
produces JSONL / CSV / HTML reports in reports/,
is small, fast, and easy to extend with dense retrievers and real LLMs.

Repository structure

clinical-genomic-rag-eval-demo/
├─ .github/
│  └─ workflows/
│     └─ ci.yml                  # GitHub Actions: tests + smoke eval + artifacts
├─ data/
│  ├─ corpus/
│  │  ├─ clin_note_001.md        # synthetic clinical note (de-identified)
│  │  ├─ genomics_oncokb_excerpt.md  # synthetic genomic knowledge snippet
│  │  ├─ pubmed_abstract_001.md  # synthetic pubmed-like abstract
│  │  └─ schema_examples/
│  │     ├─ answer_template.json # answer JSON schema (template)
│  │     └─ doc_schema.md        # corpus/chunking guidance
│  └─ eval_questions.json        # questions + gold grounding + keywords
├─ reports/
│  └─ .gitkeep                   # keep folder in VCS (artifacts land here)
├─ src/
│  ├─ __init__.py
│  ├─ chunking.py                # section-aware chunking (HPI / A&P / Pathology)
│  ├─ retriever.py               # pure-Python BM25-like retriever
│  ├─ pipeline.py                # retrieve → answer stub (citations)
│  ├─ eval_metrics.py            # hit@k, citation_recall, keyword_coverage, ...
│  ├─ run_eval.py                # run evaluation → JSONL/CSV in reports/
│  └─ report_html.py             # JSONL → HTML summary report
├─ tests/
│  ├─ test_metrics.py            # unit tests for metrics
│  └─ test_pipeline_smoke.py     # smoke: run_eval produces artifacts
├─ .editorconfig
├─ .gitattributes
├─ .gitignore
├─ .pre-commit-config.yaml       # optional local checks (black/ruff/hooks)
├─ CHANGELOG.md
├─ Makefile                      # setup / test / eval / report / clean
├─ requirements.txt              # pytest + pytest-cov (tests only)
├─ requirements-dev.txt          # optional dev tools (black/ruff/pre-commit)
└─ run.sh                        # orchestrate eval + HTML report



Metrics

hit@k — is any gold doc_id in the top-k retrieved?

citation_recall — fraction of gold docs cited in the answer.

keyword_coverage — expected keywords present in the claim (substring, case-insensitive).

context_overlap — token overlap of claim vs concatenated citation quotes (sanity for grounding).

faithfulness_stub — every citation quote shares words with the claim (very rough check).

All metrics are computed per-question and written to reports/eval_report.csv and reports/eval_report.jsonl.


Quick start

Python 3.10+

# 1) Install test deps
pip install -r requirements.txt

# 2) Run tests
pytest -q

# 3) Run evaluation (artifacts → reports/)
python -m src.run_eval --out_dir reports --top_k 3

# 4) Build HTML summary from JSONL
python -m src.report_html --in_jsonl reports/eval_report.jsonl --out_html reports/report.html



Makefile shortcuts

make setup     # pip install -r requirements.txt
make test      # pytest
make eval      # run_eval → JSONL/CSV in reports/
make report    # report_html → reports/report.html
make clean     # remove artifacts


One-liner

./run.sh 3
# JSONL → reports/eval_report.jsonl
# CSV   → reports/eval_report.csv
# HTML  → reports/report.html


Console output example

=== Clinical+Genomic RAG evaluation complete ===
JSONL: /.../reports/eval_report.jsonl
CSV:   /.../reports/eval_report.csv
HTML saved: /.../reports/report.html



Data & safety

Corpus is synthetic and de-identified (for demo only).

Section markers in clinical notes: ## HPI, ## Assessment and Plan, ## Pathology.

See data/corpus/schema_examples/doc_schema.md for formatting guidance.


Extending

Swap BM25-lite for a dense retriever (embeddings) and/or hybrid scoring.

Replace answer_stub with a real LLM client; keep the output JSON schema.

Strengthen faithfulness: claim–evidence alignment checks, factual consistency.

Add guideline-aware boosts (e.g., NCCN/ESMO) and section weighting.

Promote CI to publish report.html as a build artifact or Pages.


## Versioning

We follow Semantic Versioning. See [CHANGELOG.md](./CHANGELOG.md).  
Initial release: **v0.1.0**.


License

If you intend to open-source, add a LICENSE (e.g., MIT).
Synthetic content provided here is for demonstration and carries no clinical guarantees.


Acknowledgements

This repository was designed to help teams prototype grounded clinical RAG with auditable, lightweight metrics before integrating production corpora and models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Clinical-genomic-rag-eval-demo

Why

Repository structure

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
data		data
reports		reports
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run.sh		run.sh

SvetLuna-Lab/Clinical-genomic-rag-eval-demo

Folders and files

Latest commit

History

Repository files navigation

Clinical-genomic-rag-eval-demo

Why

Repository structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages