Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Tests

on:
push:
branches: ["**"]
pull_request:
branches: ["**"]

jobs:
tests:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: "pip"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Add project to PYTHONPATH
run: echo "PYTHONPATH=$PWD" >> $GITHUB_ENV
- name: Run tests with coverage
env:
MPLCONFIGDIR: ${{ github.workspace }}/.mpl-cache
XDG_CACHE_HOME: ${{ github.workspace }}/.cache
run: |
mkdir -p "$MPLCONFIGDIR" "$XDG_CACHE_HOME"/fontconfig
pytest --cov=src --cov-report=term-missing --cov-report=xml
- name: Generate coverage badge
run: |
mkdir -p badges
coverage-badge -o badges/coverage.svg -f
- name: Commit coverage badge (main branch only)
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
uses: stefanzweifel/git-auto-commit-action@v5
with:
commit_message: "chore: update coverage badge"
branch: main
9 changes: 9 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ XDG_CACHE_HOME ?= $(PWD)/.cache
ENV_VARS = MPLCONFIGDIR=$(MPLCONFIGDIR) XDG_CACHE_HOME=$(XDG_CACHE_HOME)

CACHE_DIRS = $(MPLCONFIGDIR) $(XDG_CACHE_HOME)/fontconfig
TEST_PYTHON ?= $(PYTHON)

all: cache_dirs data comparisons figures heatmaps run_reports master_report

Expand Down Expand Up @@ -36,3 +37,11 @@ run_reports: data cache_dirs
.PHONY: master_report
master_report: figures heatmaps run_reports cache_dirs
$(ENV_VARS) $(PYTHON) -m src.build_master_report --project $(PROJECT)

.PHONY: test
test: cache_dirs
$(ENV_VARS) $(TEST_PYTHON) -m pytest

.PHONY: coverage
coverage: cache_dirs
$(ENV_VARS) $(TEST_PYTHON) -m pytest --cov=src --cov-report=term-missing
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

Pipeline for parsing Perplexity DeepSearch outputs, comparing pseudo-enrichment programs to GO results, and generating figures/reports per project. The current default project is `glioblastoma_perplexity_manual`, but the layout supports multiple projects via per-project subdirectories.

[![Tests](https://github.com/Cellular-Semantics/langpa_validation_tools/actions/workflows/tests.yml/badge.svg)](https://github.com/Cellular-Semantics/langpa_validation_tools/actions/workflows/tests.yml)
![coverage](https://img.shields.io/badge/coverage-71%25-orange)
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage percentage mismatch: The README shows 71% coverage, but the generated SVG badge in badges/coverage.svg shows 81%. These should be consistent. Consider using the badge file directly or ensuring both are updated together.

Suggested change
![coverage](https://img.shields.io/badge/coverage-71%25-orange)
![coverage](badges/coverage.svg)

Copilot uses AI. Check for mistakes.

## Layout
- `projects/<project>/`: mapping files (`geneset_folder_mapping.csv`, `run_file_mapping.csv`), source spreadsheet (e.g., `media-3 (2).xlsx`), `description.md`.
- Inputs: `deepsearch/<project>/run_*.md`, `Comparisons/<project>/comparison geneset_*.md`, `schemas/<project>/` (placeholder).
Expand All @@ -12,6 +15,8 @@ Pipeline for parsing Perplexity DeepSearch outputs, comparing pseudo-enrichment
```bash
# activate your venv first (requires pandas, numpy, matplotlib, etc.)
PROJECT=glioblastoma_perplexity_manual make master_report
make test # run pytest
make coverage # pytest with coverage report
```
Targets: `data` (parse runs), `comparisons` (parse GO tables), `figures`, `heatmaps`, `run_reports`, `master_report`. Environment variables `MPLCONFIGDIR` and `XDG_CACHE_HOME` default to repo-local caches.

Expand Down
21 changes: 21 additions & 0 deletions badges/coverage.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 13 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,16 @@ dependencies = [
"openai>=1.0.0",
"python-dotenv",
]

[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "--strict-markers"

[tool.coverage.run]
source = ["src"]
branch = true
omit = ["src/embed_*", "src/build_master_report.py" ]
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace after the closing bracket. Consider removing it for cleaner code.

Suggested change
omit = ["src/embed_*", "src/build_master_report.py" ]
omit = ["src/embed_*", "src/build_master_report.py"]

Copilot uses AI. Check for mistakes.

[tool.coverage.report]
show_missing = true
skip_covered = true
3 changes: 3 additions & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
pytest
pytest-cov
coverage-badge
4 changes: 2 additions & 2 deletions src/build_component_mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,10 @@ def tokenize(annotation: str) -> list[str]:
return cleaned


def main() -> None:
def main(argv: list[str] | None = None) -> None:
parser = argparse.ArgumentParser(description="Build component token mapping for a project.")
add_project_argument(parser)
args = parser.parse_args()
args = parser.parse_args(argv)
paths = resolve_paths(args.project)
paths.ensure_output_dirs()
if not paths.s10_file.exists():
Expand Down
4 changes: 2 additions & 2 deletions src/build_go_terms.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ def parse_terms(raw: str) -> list[tuple[str, str]]:
return results


def main() -> None:
def main(argv: list[str] | None = None) -> None:
parser = argparse.ArgumentParser(description="Extract GO terms from Table S10 for a project.")
add_project_argument(parser)
args = parser.parse_args()
args = parser.parse_args(argv)
paths = resolve_paths(args.project)
paths.ensure_output_dirs()
if not paths.s10_file.exists():
Expand Down
6 changes: 2 additions & 4 deletions src/extract_run_payloads.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
from .project_paths import add_project_argument, resolve_paths



def extract_citations(text: str) -> list[dict]:
citations: list[dict] = []
pattern = re.compile(r"^\[\^([^\]]+)\]:\s*(\S+)", re.MULTILINE)
Expand All @@ -20,7 +19,6 @@ def extract_citations(text: str) -> list[dict]:
return citations


def main() -> None:
def main(argv: list[str] | None = None) -> None:
parser = argparse.ArgumentParser(description="Extract DeepSearch payloads and citation footnotes for a project.")
add_project_argument(parser)
Expand All @@ -42,13 +40,13 @@ def main(argv: list[str] | None = None) -> None:
payload = parse_run(run_file)

rel_folder = payload_dir / folder.name
rel_folder.mkdir(exist_ok=True)
rel_folder.mkdir(parents=True, exist_ok=True)
payload_path = rel_folder / f"{run_name}.json"
payload_path.write_text(json.dumps(payload, indent=2))

citations = extract_citations(text)
cite_folder = citation_dir / folder.name
cite_folder.mkdir(exist_ok=True)
cite_folder.mkdir(parents=True, exist_ok=True)
citation_path = cite_folder / f"{run_name}_citations.json"
citation_path.write_text(json.dumps(citations, indent=2))

Expand Down
4 changes: 2 additions & 2 deletions src/match_components.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ def normalize(vectors: np.ndarray) -> np.ndarray:
return vectors / norms


def main() -> None:
def main(argv: list[str] | None = None) -> None:
parser = argparse.ArgumentParser(description="Match component embeddings to program embeddings for a project.")
add_project_argument(parser)
args = parser.parse_args()
args = parser.parse_args(argv)
paths = resolve_paths(args.project)
paths.ensure_output_dirs()
data_dir = paths.data_dir
Expand Down
58 changes: 0 additions & 58 deletions src/rename_runs.py

This file was deleted.

1 change: 1 addition & 0 deletions tests/integration/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Package marker for integration tests.
98 changes: 98 additions & 0 deletions tests/integration/test_build_component_and_go_and_match.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
import json
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'json' is not used.

Suggested change
import json

Copilot uses AI. Check for mistakes.
import os
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Suggested change
import os

Copilot uses AI. Check for mistakes.

import numpy as np
import pandas as pd

from src.build_component_mapping import main as build_components_main
from src.build_go_terms import main as build_go_terms_main
from src.match_components import main as match_components_main
from src.project_paths import resolve_paths


def setup_project(monkeypatch, tmp_path):
monkeypatch.chdir(tmp_path)
project = "tmp_proj"
paths = resolve_paths(project)
paths.ensure_output_dirs()
return project, paths


def test_build_component_mapping_and_go_terms_main(tmp_path, monkeypatch):
project, paths = setup_project(monkeypatch, tmp_path)
# create Table S10
df = pd.DataFrame(
{
"MetaModule": [0],
"annotation": ["OPC-like 1"],
"Enriched Pathways": ["Term A (GO:1), Term B (GO:2)"],
}
)
df.to_excel(paths.s10_file, sheet_name="Table S10", index=False)
# mapping file
pd.DataFrame(
{
"metamodule": [0],
"annotation": ["OPC-like 1"],
"original_folder": ["geneset_1"],
"new_folder": ["00_Test"],
}
).to_csv(paths.mapping_file, index=False)

build_components_main(["--project", project])
build_go_terms_main(["--project", project])

comp_path = paths.data_dir / "component_mapping.csv"
go_path = paths.data_dir / "go_terms.csv"
assert comp_path.exists()
assert go_path.exists()
comp_df = pd.read_csv(comp_path)
go_df = pd.read_csv(go_path)
assert not comp_df.empty
assert set(go_df["go_term"]) == {"Term A", "Term B"}


def test_match_components_main(tmp_path, monkeypatch):
project, paths = setup_project(monkeypatch, tmp_path)
data_dir = paths.data_dir
data_dir.mkdir(parents=True, exist_ok=True)

# component mapping
pd.DataFrame(
{
"annotation": ["Test"],
"folder": ["00_Test"],
"component_token": ["tok"],
"component_key": ["tok"],
"component_order": [1],
"expanded_name": ["token name"],
"source_note": ["note"],
}
).to_csv(data_dir / "component_mapping.csv", index=False)

# component embeddings (single vector)
np.save(data_dir / "component_embeddings.npy", np.array([[1.0, 0.0]]))
pd.DataFrame(
{
"component_key": ["tok"],
"component_token": ["tok"],
"expanded_name": ["token name"],
}
).to_csv(data_dir / "component_embeddings_index.csv", index=False)

# program embeddings (one program in run 1)
np.save(data_dir / "embeddings_name.npy", np.array([[1.0, 0.0]]))
pd.DataFrame(
{
"folder": ["00_Test"],
"run_index": [1],
"program_index": [0],
"program_name": ["Prog"],
}
).to_csv(data_dir / "embeddings_index.csv", index=False)

match_components_main(["--project", project])
out_path = data_dir / "component_program_matches.csv"
df = pd.read_csv(out_path)
assert not df.empty
assert df.iloc[0]["similarity"] >= 0.99
37 changes: 37 additions & 0 deletions tests/integration/test_generate_heatmaps.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import json
import os
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'os' is not used.

Suggested change
import os

Copilot uses AI. Check for mistakes.
from pathlib import Path
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Path' is not used.

Suggested change
from pathlib import Path

Copilot uses AI. Check for mistakes.
import tempfile
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'tempfile' is not used.

Suggested change
import tempfile

Copilot uses AI. Check for mistakes.

import pandas as pd

from src.generate_heatmaps import generate_heatmaps
from src.project_paths import resolve_paths


def test_generate_heatmaps(tmp_path, monkeypatch):
monkeypatch.chdir(tmp_path)
project = "tmp_heatmap"
paths = resolve_paths(project)
paths.ensure_output_dirs()

pd.DataFrame(
{
"folder": ["00_Test", "00_Test"],
"annotation": ["Test", "Test"],
"run_index": [1, 2],
"program_index": [0, 0],
"program_name": ["A", "B"],
"supporting_genes": [json.dumps(["G1", "G2"]), json.dumps(["G2"])],
}
).to_csv(paths.data_dir / "deepsearch_programs.csv", index=False)

pd.DataFrame({"folder": ["00_Test"], "annotation": ["Test"], "duplicate": [False]}).to_csv(
paths.data_dir / "deepsearch_duplicate_runs.csv", index=False
)

monkeypatch.setenv("MPLCONFIGDIR", str(tmp_path / ".mpl"))
monkeypatch.setenv("XDG_CACHE_HOME", str(tmp_path / ".cache"))

generate_heatmaps(project)
assert (paths.analysis_dir / "confusion_heatmaps" / "00_Test_bubble.png").exists()
Loading
Loading