Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
*.swo

# Python
.venv
venv
__pycache__/
*.py[cod]
*$py.class
Expand Down Expand Up @@ -38,4 +40,4 @@ htmlcov/
docs/_build/

# Validation workspace (temporary files and reports)
validation_workspace/
validation_workspace/
33 changes: 32 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,38 @@ NCBI_API_KEY=your_ncbi_key # Optional but recommended for higher rate lim

## Usage

### Bibliography → CSL-JSON mapping

Take a DeepSearch-style bibliography (URLs, optionally with `source_id`) and return CSL-JSON keyed by the original reference numbers:

```python
from lit_agent.identifiers import resolve_bibliography

bibliography = [
{"source_id": "1", "url": "https://pubmed.ncbi.nlm.nih.gov/37674083/"},
{"source_id": "2", "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC11239014/"},
{"source_id": "3", "url": "https://doi.org/10.1038/s41586-023-06502-w"},
]

result = resolve_bibliography(
bibliography,
validate=True, # NCBI/metapub validation + metadata fetch
scrape=False, # Enable if you want web/PDF scraping for failures
pdf=False,
topic_validation=False,
)

print(result.citations["1"]["PMID"]) # "37674083"
print(result.citations["2"]["PMCID"]) # "PMC11239014"
print(result.citations["3"]["DOI"]) # "10.1038/s41586-023-06502-w"
print(result.citations["1"]["resolution"]) # methods, confidence, validation, errors
```

Each citation is CSL-JSON–compatible with a custom `resolution` block:
- `id` is the original `source_id` (or 1-based string if absent)
- `URL`, identifiers (`DOI`/`PMID`/`PMCID`), optional metadata (`title`, `author`, `container-title`, `issued`, etc.)
- `resolution`: `confidence`, `methods`, `validation` statuses, `errors`, `source_url`, optional `canonical_id`

### Academic Identifier Extraction

Extract DOI, PMID, and PMC identifiers from academic URLs with comprehensive validation:
Expand Down Expand Up @@ -353,4 +385,3 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
- Built with [LiteLLM](https://github.com/BerriAI/litellm) for unified LLM API access
- Uses [uv](https://github.com/astral-sh/uv) for fast Python package management
- Code quality maintained with [black](https://github.com/psf/black) and [ruff](https://github.com/astral-sh/ruff)

45 changes: 45 additions & 0 deletions plans/aim1-url2ref-functional-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Aim 1 – url2ref functionality plan (standalone)

## Goal
Expand `url2ref` (lit_agent) so it can take a numbered bibliography (URLs) from upstream systems (e.g., DeepSearch) and return a citation map keyed by the original reference numbers. Each entry should be CSL-JSON–compatible, enriched with resolved identifiers and confidence/validation details.

## Chosen citation schema
- **CSL-JSON** as the citation payload: stable, widely supported, flexible for partial metadata.
- Fields we commit to populate when available: `id` (ref_id), `URL`, `type`, `title`, `author` (family/given), `issued` (`date-parts`), `container-title`, `publisher`, `page`, `volume`, `issue`, `DOI`, `PMID`, `PMCID`.
- Add a `resolution` object (custom) with: `confidence` (0–1), `methods` (ordered list of extraction methods), `validation` (e.g., `{"ncbi": "passed" | "failed" | "skipped", "metapub": ...}`), `errors` (optional list), and `source_url` for traceability.
- Numbering: preserve the **original ref number** (stringified) from the input order. Never renumber. If deduplication is applied, keep both the original `id` and a `canonical_id` for grouping.

## New/updated APIs
- High-level function: `resolve_bibliography(urls: list[str], *, validate=True, scrape=True, pdf=True, topic_validation=False) -> CitationResolutionResult`
- Input: ordered list of URLs (implicitly numbered starting at 1).
- Output: `CitationResolutionResult` with:
- `citations: dict[str, CSLJSONCitation]` keyed by ref_id (`"1"`, `"2"`, ...).
- `stats`: counts for resolved/unresolved, by method, average confidence, validation outcomes.
- `failures`: list of ref_ids with reasons.
- Keep existing `extract_identifiers_from_bibliography` public but use it internally.

## Processing pipeline
1) **Identifier extraction (existing)**: reuse `JournalURLExtractor` → `CompositeValidator` (NCBI/metapub) with per-identifier confidence.
2) **Phase 2 (optional)**: web scraping/PDF extraction for failed URLs; track methods.
3) **Metadata enrichment**:
- Primary: `NCBIAPIValidator.get_article_metadata` when PMID/PMCID present.
- DOI metadata lookup (CrossRef or other) if available in codebase or add light DOI resolver (no network if policies forbid; allow graceful skip).
- Map metadata to CSL-JSON fields; fill `issued` from year (or full date if present).
4) **Record assembly**:
- For each URL (ref_id), create CSL-JSON object with `id = ref_id`, `URL = url`, identifiers (`DOI`, `PMID`, `PMCID`), and enriched metadata.
- Attach `resolution` with method path and validation outcomes; if unresolved, include `errors` and leave identifiers blank.
5) **Stats/reporting**:
- Aggregate success/failure, per-method success, confidence histograms, validation pass rates.
- Optional: expose `to_json()` for container embedding.

## Edge cases & rules
- Preserve input order as authoritative numbering; never reshuffle.
- If multiple identifiers per URL, keep all (`DOI`, `PMID`, `PMCID`); prefer PMID/PMCID to fetch metadata, but do not drop DOI.
- If metadata fetch fails, still return identifiers and source URL with low confidence.
- If scraping/PDF disabled or not permitted, mark validation as `skipped` and return partial data.
- Keep network-dependent steps optional via flags; ensure graceful degradation without secrets.

## Testing
- Unit tests: fixture URLs → expected CSL-JSON snippets; confidence/method tracking; unresolved paths.
- Integration (if allowed): small curated URLs hitting NCBI (and DOI if available) with recorded responses; fallback to mocks when offline.
- Schema checks: validate produced citation map against CSL-JSON structure + custom `resolution` fragment.
6 changes: 4 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "lit-agent"
name = "url2ref"
version = "0.1.0"
description = "Reference extraction agent for analyzing Deepsearch results"
authors = [{name = "Research Team"}]
Expand Down Expand Up @@ -38,11 +38,13 @@ markers = [
"unit: fast, isolated tests",
"integration: tests that hit real services or I/O"
]
filterwarnings = [
"ignore:invalid escape sequence.*docopt.*:SyntaxWarning",
]

[tool.coverage.run]
branch = true
source = ["src/lit_agent"]

[tool.coverage.report]
skip_empty = true

4 changes: 4 additions & 0 deletions src/lit_agent/identifiers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@
extract_identifiers_from_bibliography,
extract_identifiers_from_url,
validate_identifier,
resolve_bibliography,
CitationResolutionResult,
)

# Demo functionality
Expand Down Expand Up @@ -66,6 +68,8 @@
"extract_identifiers_from_bibliography",
"extract_identifiers_from_url",
"validate_identifier",
"resolve_bibliography",
"CitationResolutionResult",
# Demo
"demo_extraction",
]
Loading