Skip to content

musharna/ghostcite

ghostcite

ghostcite

PyPI CI License: MIT Python 3.9+

Catch ghost citations — right DOI, wrong author.

ghostcite catching a ghost citation

ghostcite is a deterministic, no-LLM command-line tool that cross-checks a bibliography's claimed author and year against CrossRef's canonical record for each DOI. It catches the dominant ghost-citation failure mode — a reference whose cited authorship doesn't match the paper the DOI actually points to — and flags retracted or expression-of-concern works along the way.

The problem

LLM-assisted writing (and plain copy-paste drift) routinely produces references that look right but attribute the cited DOI to the wrong authors or year. A manuscript cites "Li et al. 2024," but DOI 10.3390/plants13060869 is actually Chen et al. A reviewer catches it; an automated check catches it first.

Does the metadata you wrote for this citation match what CrossRef says the DOI actually is?

No model, no API key, no download — just CrossRef's REST API and a comparison.

Install

pip install ghostcite          # into the current environment
pipx install ghostcite         # isolated CLI install (recommended)
uv tool install ghostcite      # if you use uv

Usage

ghostcite refs.bib                         # check a BibTeX file (or .md / DOI list)
ghostcite refs.bib --cross-check pubmed    # corroborate against PubMed
ghostcite refs.bib --json                  # machine-readable output (for CI)
ghostcite refs.bib --fail-on author,year,retraction   # tune the CI gate
cat refs.bib | ghostcite -                 # read from stdin

Input format is auto-detected (BibTeX, Markdown reference list, or bare DOI list); override with --format {auto,bibtex,markdown,doi}.

Real examplerefs.bib cites "Li (2024)" for a DOI CrossRef says is Chen:

$ ghostcite refs.bib
ghostcite: 1 entries, 1 with DOIs
  ✗ A  L1  Li (2024)  →  DOI resolves to Chen (2024) — possibly wrong DOI  [10.3390/plants13060869]
  1 A
$ echo $?
1
All flags & the anatomy of a finding
  ✗ A   L1    Li (2024)        →  DOI resolves to Chen (2024)…   [10.3390/plants13060869]
  │ │   │     │                    │                               │
  │ │   │     │                    │                               └─ DOI that was checked
  │ │   │     │                    └─ what CrossRef actually records
  │ │   │     └─ what you cited (claimed first author + year)
  │ │   └─ source line in your bibliography
  │ └─ tier: A author · B year · C cosmetic · R retraction · U unresolvable
  └─ glyph: ✗ fails CI · ⚠ retraction · · informational
  • --cross-check pubmed — adds PubMed/NCBI as a second source of truth. When PubMed backs CrossRef a finding is annotated ↳ corroborated by PubMed; when PubMed instead agrees with what you cited, it's flagged as a CrossRef↔PubMed conflict (the tier is kept so you don't silently trust either source). PubMed can also raise a finding CrossRef missed, or supply a record for a DOI absent from CrossRef. Optional --ncbi-email / --ncbi-api-key (or NCBI_EMAIL / NCBI_API_KEY) follow NCBI E-utilities etiquette and unlock a higher rate limit; neither is required.
  • --max-rps <n> — cap outbound requests per second. ghostcite already self-throttles to CrossRef's advertised rate limit (read from the response headers); --max-rps lets you be more conservative (the stricter of the two wins).
  • --color {auto,always,never} — colorize the tier glyphs. auto (default) colorizes only on a TTY. NO_COLOR is honored and wins even over always. --json output is never colorized.
  • stdin (-) — pass - as the filename to read from stdin, e.g. cat refs.bib | ghostcite - or ghostcite - --format doi < dois.txt.
  • --dry-run — parse + classify + count only, no network.

See examples/ for ready-to-run sample inputs and captured output.

How it works

flowchart TD
    A["Citation: claimed author + year (+ DOI)"] --> B{"Has DOI?"}
    B -- yes --> C["GET CrossRef /works/{DOI}"]
    B -- no --> D["CrossRef bibliographic search<br/>(low-confidence)"]
    C --> E{"DOI resolves?"}
    E -- no --> U["Tier U — unresolvable"]
    E -- yes --> F["Compare claimed vs. canonical record"]
    D --> F
    F --> G{"First-author surname matches?"}
    G -- no --> TA["Tier A — author mismatch"]
    G -- yes --> H{"Year matches?"}
    H -- no --> TB["Tier B — year mismatch"]
    H -- yes --> OK["OK"]
    C --> R{"Retracted / expression of concern?"}
    R -- yes --> TR["Tier R — retraction (orthogonal)"]
    F -. "--cross-check pubmed" .-> P["PubMed second opinion"]
Loading

No language model is involved at any step. ghostcite resolves each DOI at CrossRef (and optionally PubMed), then does a pure, deterministic comparison of the claimed first-author surname (Unicode-folded, punctuation-stripped) and year against the canonical record, plus a retraction / expression-of-concern check. Only the HTTP client touches the network, via CrossRef's polite pool (a descriptive User-Agent with the project URL, never a personal email).

Severity tiers, input formats & exit codes
Tier Meaning Fails CI?
A author-mismatch — claimed first author isn't in CrossRef's authors Yes
B year-mismatch — author matches, claimed year differs Yes
C cosmetic — matches only after diacritic/initials fold (Bürger≈Burger) No (info)
R retraction / expression-of-concern per CrossRef Yes (fires regardless of A/B/C)
U unresolvable — DOI 404s, or no-DOI entry search was inconclusive No (warn)
OK first author + year match

When the claimed title also diverges strongly from CrossRef's title, a Tier A finding is annotated "possibly wrong DOI entirely" to distinguish a wrong-author citation from a wrong-DOI one.

Format Detection Yields claimed author/year?
BibTeX @article{…} / @…{…} entries Yes (author, year, doi, title)
Markdown bullet refs - **AuthorList (YYYY).** … 10.x … Yes
DOI list newline-delimited bare DOIs / doi: / DOI URLs No — lookup + retraction sweep only
Exit code Meaning
0 clean — no findings at or above the fail threshold
1 findings present at/above the threshold
2 tool error (network down, unparseable input, …)

--fail-on (default author,year,retraction) selects which tiers force exit 1; --fail-on none runs as a passive reporter. Tiers C and U never force exit 1.

Use it in CI

A clean run is quiet and exits 0:

ghostcite clean run

Drop in the composite GitHub Action:

- uses: musharna/ghostcite@v1
  with:
    paths: paper/refs.bib
    fail-on: "author,year,retraction"

…or the pre-commit hook:

repos:
  - repo: https://github.com/musharna/ghostcite
    rev: v0.1.0
    hooks:
      - id: ghostcite
        args: [paper/references.bib, --fail-on, "author,year,retraction"]

Either way, a finding at or above the --fail-on threshold returns a non-zero exit, blocking the merge or commit before submission.

Scope & limitations

ghostcite checks metadata correctness (does the DOI's record match what you wrote), not claim support (does the source actually say what your prose claims — a separate, LLM-based concern). It does no auto-fixing and no citation-style linting. CrossRef is the source of truth; --cross-check pubmed adds PubMed as an optional second opinion.

  • CrossRef stores particle surnames inconsistently (van der Berg vs Berg), so a correctly-cited prefixed surname can rarely produce a Tier A false positive.
  • No-DOI entries are resolved by best-effort bibliographic search and flagged low-confidence — treat those as hints, not verdicts.
  • Some preprints, datasets, and protocols carry no author metadata in CrossRef and surface as Tier U rather than a mismatch.
Related work & FAQ

ghostcite's niche is deterministic, no-LLM, CLI-first checking focused on the byline-mismatch failure mode (right DOI, wrong author/year) plus retraction flagging — built to run unattended in CI.

Tool What it does How ghostcite differs
RefChecker LLM-powered web-search reference validator ghostcite is no-LLM, deterministic, and CI-safe (no model, no API key)
claude-skill-citation-checker A Claude Code skill for an LLM agent ghostcite is a standalone CLI + Action — no agent or LLM host needed
BibTeX Verifier In-browser BibTeX checker ghostcite is scriptable from the CLI and also flags retractions
CERCA Java / AGPL citation checker ghostcite is Python / MIT / pip install-able
scite Reference Check Commercial, PDF-oriented, retraction focus ghostcite is free / open-source, BibTeX-native, and catches byline mismatch
doimgr Formats and manages DOIs (doesn't validate) ghostcite verifies byline and retraction status, not just formatting

Does it call an LLM? No — a deterministic comparison of the metadata you wrote against CrossRef's (and optionally PubMed's) canonical record. No model, no prompt, no API key required.

Will it hit rate limits? It self-throttles to CrossRef's advertised rate limit (read from the live response headers); use --max-rps to be more conservative.

Does it catch fabricated DOIs? Indirectly — a DOI that 404s at CrossRef surfaces as Tier U. The core check is byline-vs-DOI consistency, so it catches the common case of a real DOI attached to the wrong citation.

License

MIT — see LICENSE.

About

Deterministic, no-LLM CLI that catches ghost citations — when the author/year you cited doesn't match the DOI's CrossRef record (+ retraction flags).

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages