Scholar to PDF: Automated Reference Collector

Reference Collector automates fetching papers from Google Scholar, enriches them with Crossref, and downloads PDFs via arXiv-first then Sci-Hub, producing a metadata spreadsheet and a missing/failed report. It saves time versus manual downloading by auto-grabbing available PDFs and clearly listing any papers it couldn’t fetch for easy follow‑up.

Author: Xuewen Zhang ([email protected], 2026).

I. How to use

1. Install required packages

python -m pip install -r requirements.txt

or Go to Release page to download the software installer.

2. Run the pipeline (Scholar Crossref PDFs)

From a Scholar profile (auto-paginates, pagesize 100):

python main.py --profile-url "google scholar profile page url" --workdir _results/demo

From a "cited by" search:

python main.py --cited-url "google scholar paper 'cited by' page url" --workdir _results/cited_demo

Skip PDF downloads (metadata only):

python main.py --profile-url "..." --no-download

Helpful flags: --max-pages, --limit, --mailto [email protected], --proxy http://user:pass@host:port.
Example screenshot：

3. Start from an existing metadata sheet

If you already have titles/DOIs in Excel/CSV:

python main.py --from-xlsx /path/to/metadata.xlsx --workdir _results/from_sheet

4. Export cookies when Scholar blocks

Google Scholar may return captchas. Reuse your solved browser session:

Open Scholar, solve the captcha.
Install Get cookies.txt (Chrome/Edge).
With the Scholar tab active: extension Export save as cookies.txt.
Run with:

python main.py --profile-url "..." --cookies-file cookies.txt

Tips: keep cookies.txt private; its already Netscape format; ensure entries include scholar.google.com.

II. What happens under the hood

Scrape Scholar (profile or cited-by) with delays and cookies/proxy support.
Crossref enrichment with strict full-title matching (no short-title shortcuts) for DOI/venue/publisher URL.
Metadata sheet: <workdir>/metadata.xlsx columns
year, title, doi, publisher_link, venue, venue_abbrev, short_title, target_name, source_link.
Download PDFs
- First try direct arXiv PDF (https://arxiv.org/pdf/<id>.pdf) when DOI indicates arXiv.
- Otherwise try Sci-Hub mirrors; if uncertain or blocked, mark as missing (no guessing).
Report: <workdir>/missing_or_failed.md lists missing DOIs and failed downloads with source links.

---

III. Source tree

ref_collector/
├── README.md
├── main.py
├── requirements.txt
├── ref_collector/
│   ├── workflow.py              # Orchestrates scrape  enrich  download  report
│   ├── scholar_scraper.py       # Google Scholar profile / cited-by scraper
│   ├── metadata_enricher.py     # Crossref lookup + strict title matching
│   ├── naming.py                # Safe filenames, venue abbrev, title cleaning
│   └──  scihub_pdf_collector/
│       └──  download.py          # ArXiv-first, Sci-Hub-fallback PDF downloader + report
└── _results/ (created at runtime)

IV. ArXiv-aware downloading

DOIs like 10.48xxx/arXiv.1701.03xxx trigger a direct fetch from https://arxiv.org/pdf/1701.03xxx.pdf.
If that fails, the task falls back to Sci-Hub; if still unsuccessful, the item is reported as missing (no mismatched PDFs).

V. Troubleshooting

Scholar captcha: re-export cookies.txt, reduce --pagesize, increase --scrape-delay.
Wrong DOI risk: strict title matching (=0.90 similarity) prevents near misses; unmatched items stay DOI-empty.
No PDF: check missing_or_failed.md for source links; manually download if needed.
Proxies: apply to all stages with --proxy http://....

VI. Note

Note: metadata and downloads are automatically generated from Google Scholar + Crossref + Sci-Hub/arXiv. Wrong/mismatched papers can still slip through; please review results before critical use (e.g., publications or regulatory submissions).

Note: Downloads rely on Sci-Hub/arXiv. Recent papers (often post-2022) may be unavailable; expect higher failure rates for newly published or paywalled articles.

Release version (UI installer)

Download the latest installer in the release page (creates Start Menu/Desktop shortcuts).
Run the installer, then launch Reference Collector UI. The app uses the Forest ttk theme by RDBende (MIT) for a clean green look.
First launch will check for Python deps; allow the prompt to auto-install from requirements.txt.
Basic UI flow: paste a Scholar Profile/Cited-by URL, optionally tick Load metadata (xlsx/csv) to pick a sheet, choose Output path, optional cookies.txt, set a limit, and click Run. Colored log panel mirrors the CLI output.
Placeholder for UI screenshot (replace with your capture):

Acknowledgment

Developed and maintained by Xuewen Zhang ([email protected], 2026).
UI theme: Forest ttk theme by RDBende (MIT).

Citation

If this tool helps your work, please cite the project and the associated paper you are using it for.

License

Apache 2.0 see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scholar to PDF: Automated Reference Collector

Content

I. How to use

1. Install required packages

2. Run the pipeline (Scholar Crossref PDFs)

3. Start from an existing metadata sheet

4. Export cookies when Scholar blocks

II. What happens under the hood

III. Source tree

IV. ArXiv-aware downloading

V. Troubleshooting

VI. Note

Release version (UI installer)

Acknowledgment

Citation

License

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
_data		_data
ref_collector		ref_collector
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

Zhang-Xuewen/Reference-Collector

Folders and files

Latest commit

History

Repository files navigation

Scholar to PDF: Automated Reference Collector

Content

I. How to use

1. Install required packages

2. Run the pipeline (Scholar Crossref PDFs)

3. Start from an existing metadata sheet

4. Export cookies when Scholar blocks

II. What happens under the hood

III. Source tree

IV. ArXiv-aware downloading

V. Troubleshooting

VI. Note

Release version (UI installer)

Acknowledgment

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages