Cellular-Semantics · dosumis · Nov 26, 2025 · Nov 26, 2025 · Nov 26, 2025 · Nov 26, 2025
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,8 @@
 *.swo
 
 # Python
+.venv
+venv
 __pycache__/
 *.py[cod]
 *$py.class
@@ -38,4 +40,4 @@ htmlcov/
 docs/_build/
 
 # Validation workspace (temporary files and reports)
-validation_workspace/
+validation_workspace/
diff --git a/README.md b/README.md
@@ -54,6 +54,38 @@ NCBI_API_KEY=your_ncbi_key        # Optional but recommended for higher rate lim
 
 ## Usage
 
+### Bibliography → CSL-JSON mapping
+
+Take a DeepSearch-style bibliography (URLs, optionally with `source_id`) and return CSL-JSON keyed by the original reference numbers:
+
+```python
+from lit_agent.identifiers import resolve_bibliography
+
+bibliography = [
+    {"source_id": "1", "url": "https://pubmed.ncbi.nlm.nih.gov/37674083/"},
+    {"source_id": "2", "url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC11239014/"},
+    {"source_id": "3", "url": "https://doi.org/10.1038/s41586-023-06502-w"},
+]
+
+result = resolve_bibliography(
+    bibliography,
+    validate=True,     # NCBI/metapub validation + metadata fetch
+    scrape=False,      # Enable if you want web/PDF scraping for failures
+    pdf=False,
+    topic_validation=False,
+)
+
+print(result.citations["1"]["PMID"])     # "37674083"
+print(result.citations["2"]["PMCID"])    # "PMC11239014"
+print(result.citations["3"]["DOI"])      # "10.1038/s41586-023-06502-w"
+print(result.citations["1"]["resolution"])  # methods, confidence, validation, errors
+```
+
+Each citation is CSL-JSON–compatible with a custom `resolution` block:
+- `id` is the original `source_id` (or 1-based string if absent)
+- `URL`, identifiers (`DOI`/`PMID`/`PMCID`), optional metadata (`title`, `author`, `container-title`, `issued`, etc.)
+- `resolution`: `confidence`, `methods`, `validation` statuses, `errors`, `source_url`, optional `canonical_id`
+
 ### Academic Identifier Extraction
 
 Extract DOI, PMID, and PMC identifiers from academic URLs with comprehensive validation:
@@ -353,4 +385,3 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
 - Built with [LiteLLM](https://github.com/BerriAI/litellm) for unified LLM API access
 - Uses [uv](https://github.com/astral-sh/uv) for fast Python package management
 - Code quality maintained with [black](https://github.com/psf/black) and [ruff](https://github.com/astral-sh/ruff)
-
diff --git a/plans/aim1-url2ref-functional-plan.md b/plans/aim1-url2ref-functional-plan.md
@@ -0,0 +1,45 @@
+# Aim 1 – url2ref functionality plan (standalone)
+
+## Goal
+Expand `url2ref` (lit_agent) so it can take a numbered bibliography (URLs) from upstream systems (e.g., DeepSearch) and return a citation map keyed by the original reference numbers. Each entry should be CSL-JSON–compatible, enriched with resolved identifiers and confidence/validation details.
+
+## Chosen citation schema
+- **CSL-JSON** as the citation payload: stable, widely supported, flexible for partial metadata.
+- Fields we commit to populate when available: `id` (ref_id), `URL`, `type`, `title`, `author` (family/given), `issued` (`date-parts`), `container-title`, `publisher`, `page`, `volume`, `issue`, `DOI`, `PMID`, `PMCID`.
+- Add a `resolution` object (custom) with: `confidence` (0–1), `methods` (ordered list of extraction methods), `validation` (e.g., `{"ncbi": "passed" | "failed" | "skipped", "metapub": ...}`), `errors` (optional list), and `source_url` for traceability.
+- Numbering: preserve the **original ref number** (stringified) from the input order. Never renumber. If deduplication is applied, keep both the original `id` and a `canonical_id` for grouping.
+
+## New/updated APIs
+- High-level function: `resolve_bibliography(urls: list[str], *, validate=True, scrape=True, pdf=True, topic_validation=False) -> CitationResolutionResult`
+  - Input: ordered list of URLs (implicitly numbered starting at 1).
+  - Output: `CitationResolutionResult` with:
+    - `citations: dict[str, CSLJSONCitation]` keyed by ref_id (`"1"`, `"2"`, ...).
+    - `stats`: counts for resolved/unresolved, by method, average confidence, validation outcomes.
+    - `failures`: list of ref_ids with reasons.
+- Keep existing `extract_identifiers_from_bibliography` public but use it internally.
+
+## Processing pipeline
+1) **Identifier extraction (existing)**: reuse `JournalURLExtractor` → `CompositeValidator` (NCBI/metapub) with per-identifier confidence.
+2) **Phase 2 (optional)**: web scraping/PDF extraction for failed URLs; track methods.
+3) **Metadata enrichment**:
+   - Primary: `NCBIAPIValidator.get_article_metadata` when PMID/PMCID present.
+   - DOI metadata lookup (CrossRef or other) if available in codebase or add light DOI resolver (no network if policies forbid; allow graceful skip).
+   - Map metadata to CSL-JSON fields; fill `issued` from year (or full date if present).
+4) **Record assembly**:
+   - For each URL (ref_id), create CSL-JSON object with `id = ref_id`, `URL = url`, identifiers (`DOI`, `PMID`, `PMCID`), and enriched metadata.
+   - Attach `resolution` with method path and validation outcomes; if unresolved, include `errors` and leave identifiers blank.
+5) **Stats/reporting**:
+   - Aggregate success/failure, per-method success, confidence histograms, validation pass rates.
+   - Optional: expose `to_json()` for container embedding.
+
+## Edge cases & rules
+- Preserve input order as authoritative numbering; never reshuffle.
+- If multiple identifiers per URL, keep all (`DOI`, `PMID`, `PMCID`); prefer PMID/PMCID to fetch metadata, but do not drop DOI.
+- If metadata fetch fails, still return identifiers and source URL with low confidence.
+- If scraping/PDF disabled or not permitted, mark validation as `skipped` and return partial data.
+- Keep network-dependent steps optional via flags; ensure graceful degradation without secrets.
+
+## Testing
+- Unit tests: fixture URLs → expected CSL-JSON snippets; confidence/method tracking; unresolved paths.
+- Integration (if allowed): small curated URLs hitting NCBI (and DOI if available) with recorded responses; fallback to mocks when offline.
+- Schema checks: validate produced citation map against CSL-JSON structure + custom `resolution` fragment.
diff --git a/pyproject.toml b/pyproject.toml
@@ -3,7 +3,7 @@ requires = ["setuptools>=68", "wheel"]
 build-backend = "setuptools.build_meta"
 
 [project]
-name = "lit-agent"
+name = "url2ref"
 version = "0.1.0"
 description = "Reference extraction agent for analyzing Deepsearch results"
 authors = [{name = "Research Team"}]
@@ -38,11 +38,13 @@ markers = [
   "unit: fast, isolated tests",
   "integration: tests that hit real services or I/O"
 ]
+filterwarnings = [
+  "ignore:invalid escape sequence.*docopt.*:SyntaxWarning",
+]
 
 [tool.coverage.run]
 branch = true
 source = ["src/lit_agent"]
 
 [tool.coverage.report]
 skip_empty = true
-
diff --git a/src/lit_agent/identifiers/__init__.py b/src/lit_agent/identifiers/__init__.py
@@ -34,6 +34,8 @@
     extract_identifiers_from_bibliography,
     extract_identifiers_from_url,
     validate_identifier,
+    resolve_bibliography,
+    CitationResolutionResult,
 )
 
 # Demo functionality
@@ -66,6 +68,8 @@
     "extract_identifiers_from_bibliography",
     "extract_identifiers_from_url",
     "validate_identifier",
+    "resolve_bibliography",
+    "CitationResolutionResult",
     # Demo
     "demo_extraction",
 ]