Skip to content

raphschlatt/ads-bib

Repository files navigation

ads-bib

Python 3.12 License MIT Docs Open in Colab

ads-bib takes a NASA ADS search query and produces a normalized, curated dataset, with disambiguated author names (AND via ads-and), topic models (via BERTopic or Toponymy), and citation networks ready for e.g. Gephi, CiteSpace, or VOSviewer, locally or via API.

Installation

Use uv and Python 3.12.

uv pip install ads-bib
# or: pip install ads-bib

Quick Start

Create a .env file in your project root with the relevant API keys.

ADS_TOKEN=your-ads-token           # required
OPENROUTER_API_KEY=your-key        # only for the openrouter road
HF_TOKEN=your-key                  # for hf_api and local_gpu model access
MODAL_TOKEN_ID=your-modal-id       # only for AND with backend=modal
MODAL_TOKEN_SECRET=your-modal-secret

ADS user token settings | OpenRouter Keys | Hugging Face Access Tokens | Modal.

Then run in your terminal:

ads-bib run --preset openrouter --set search.query='author:"Hawking, S*"'

Author name disambiguation is off by default. Enable the local CPU/GPU path with --set author_disambiguation.enabled=true; use --set author_disambiguation.backend=modal only when your Modal credentials are configured.

Full setup details: Get Started | Runtime Roads

Iterate From a Previous Run

Every run writes config_used.yaml and reusable stage artifacts. To try one change without repeating the whole pipeline, start a variant from that run:

ads-bib run --from-run run_20260407_120000_ads_bib_openrouter \
  --set topic_model.embedding_model=google/gemini-embedding-001

ads-bib loads the previous config, applies the override, chooses the earliest stage that needs recomputation, and writes a new run folder with a variant block in run_summary.yaml. Preview the reuse plan first with --dry-run.

Python API

import ads_bib

ads_bib.run(
    preset="openrouter",
    query='author:"Hawking, S*"',
)

More examples and the NotebookSession interface: Python API docs

Pick a Runtime Road

Road Hardware Network Cost
openrouter any API calls pay-per-token
hf_api any API calls HF-plan-dependent
local_cpu CPU only model downloads only free after setup
local_gpu NVIDIA + CUDA model downloads only free after setup

Full provider matrix and first-run behavior: Runtime Roads

Output

Each project folder keeps shared caches under data/cache/ and writes every run under runs/<run_id>/:

runs/<run_id>/
├── config_used.yaml
├── run_summary.yaml
├── data/
│   ├── search/        # run-local ADS search result used for export variants
│   ├── export/        # pre-translation publications and references
│   ├── translated/    # translated publications and references
│   ├── tokenized/     # tokenized publications and references
│   ├── and/           # disambiguated frames plus optional ads-and diagnostics
│   ├── dataset/       # final publications, references, topic_info, manifest
│   └── citations/     # GEXF/CSV/JSON networks and WOS export
├── plots/topic_map.html
└── logs/runtime.log
  • data/search|export|translated|tokenized|and/ — run-local stage boundaries used by --from-run variants
  • data/dataset/publications.parquet — cleaned, translated, topic-labeled publications, with disambiguated authors when AND is enabled
  • data/dataset/references.parquet — normalized cited-reference metadata, with disambiguated authors when AND is enabled
  • data/dataset/topic_info.parquet — one row per topic with labels, counts, and representation fields
  • plots/topic_map.html — interactive topic visualization (open in any browser), using datamapplot
  • data/citations/*.gexf — direct citation, co-citation, bibliographic coupling, author co-citation
  • data/citations/download_wos_export.txt — Web of Science format for e.g. CiteSpace / VOSviewer
  • run_summary.yaml — full run metadata, stage status, and optional variant provenance
  • data/dataset/dataset_manifest.json — artifact hashes plus bundle-cleaning provenance

Interactive topic map from the Hawking query Topic map output from author:"Hawking, S*" in datamapplot.

Author co-citation network from the Hawking query Author co-citation output from author:"Hawking, S*" in Gephi Lite.

About

Pipeline for querying and turning NASA's ADS publications metadata into curated, analysis-ready datasets, topic maps, and citation networks.

Topics

Resources

License

Stars

Watchers

Forks

Contributors