Reference Collector automates fetching papers from Google Scholar, enriches them with Crossref, and downloads PDFs via arXiv-first then Sci-Hub, producing a metadata spreadsheet and a missing/failed report. It saves time versus manual downloading by auto-grabbing available PDFs and clearly listing any papers it couldn’t fetch for easy follow‑up.
Author: Xuewen Zhang ([email protected], 2026).
- I. How to use
- II. What happens under the hood
- III. Source tree
- IV. ArXiv-aware downloading
- V. Troubleshooting
- VI. Note
- Release version (UI installer)
- Acknowledgment
- Citation
- License
python -m pip install -r requirements.txtor Go to Release page to download the software installer.
- From a Scholar profile (auto-paginates, pagesize 100):
python main.py --profile-url "google scholar profile page url" --workdir _results/demo- From a "cited by" search:
python main.py --cited-url "google scholar paper 'cited by' page url" --workdir _results/cited_demo- Skip PDF downloads (metadata only):
python main.py --profile-url "..." --no-download-
Helpful flags:
--max-pages,--limit,--mailto [email protected],--proxy http://user:pass@host:port. -
Example screenshot:
If you already have titles/DOIs in Excel/CSV:
python main.py --from-xlsx /path/to/metadata.xlsx --workdir _results/from_sheetGoogle Scholar may return captchas. Reuse your solved browser session:
- Open Scholar, solve the captcha.
- Install Get cookies.txt (Chrome/Edge).
- With the Scholar tab active: extension Export save as
cookies.txt. - Run with:
python main.py --profile-url "..." --cookies-file cookies.txtTips: keep cookies.txt private; its already Netscape format; ensure entries include scholar.google.com.
- Scrape Scholar (profile or cited-by) with delays and cookies/proxy support.
- Crossref enrichment with strict full-title matching (no short-title shortcuts) for DOI/venue/publisher URL.
- Metadata sheet:
<workdir>/metadata.xlsxcolumns
year, title, doi, publisher_link, venue, venue_abbrev, short_title, target_name, source_link. - Download PDFs
- First try direct arXiv PDF (
https://arxiv.org/pdf/<id>.pdf) when DOI indicates arXiv. - Otherwise try Sci-Hub mirrors; if uncertain or blocked, mark as missing (no guessing).
- First try direct arXiv PDF (
- Report:
<workdir>/missing_or_failed.mdlists missing DOIs and failed downloads with source links.
ref_collector/
├── README.md
├── main.py
├── requirements.txt
├── ref_collector/
│ ├── workflow.py # Orchestrates scrape enrich download report
│ ├── scholar_scraper.py # Google Scholar profile / cited-by scraper
│ ├── metadata_enricher.py # Crossref lookup + strict title matching
│ ├── naming.py # Safe filenames, venue abbrev, title cleaning
│ └── scihub_pdf_collector/
│ └── download.py # ArXiv-first, Sci-Hub-fallback PDF downloader + report
└── _results/ (created at runtime)
- DOIs like
10.48xxx/arXiv.1701.03xxxtrigger a direct fetch fromhttps://arxiv.org/pdf/1701.03xxx.pdf. - If that fails, the task falls back to Sci-Hub; if still unsuccessful, the item is reported as missing (no mismatched PDFs).
- Scholar captcha: re-export
cookies.txt, reduce--pagesize, increase--scrape-delay. - Wrong DOI risk: strict title matching (=0.90 similarity) prevents near misses; unmatched items stay DOI-empty.
- No PDF: check
missing_or_failed.mdfor source links; manually download if needed. - Proxies: apply to all stages with
--proxy http://....
Note: metadata and downloads are automatically generated from Google Scholar + Crossref + Sci-Hub/arXiv. Wrong/mismatched papers can still slip through; please review results before critical use (e.g., publications or regulatory submissions).
Note: Downloads rely on Sci-Hub/arXiv. Recent papers (often post-2022) may be unavailable; expect higher failure rates for newly published or paywalled articles.
- Download the latest installer in the release page (creates Start Menu/Desktop shortcuts).
- Run the installer, then launch Reference Collector UI. The app uses the Forest ttk theme by
RDBende(MIT) for a clean green look. - First launch will check for Python deps; allow the prompt to auto-install from
requirements.txt. - Basic UI flow: paste a Scholar Profile/Cited-by URL, optionally tick Load metadata (xlsx/csv) to pick a sheet, choose Output path, optional
cookies.txt, set a limit, and click Run. Colored log panel mirrors the CLI output. - Placeholder for UI screenshot (replace with your capture):
Developed and maintained by Xuewen Zhang ([email protected], 2026).
UI theme: Forest ttk theme by RDBende (MIT).
If this tool helps your work, please cite the project and the associated paper you are using it for.
Apache 2.0 see LICENSE.



