Skip to content

Zhang-Xuewen/Reference-Collector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scholar to PDF: Automated Reference Collector

Architecture

Reference Collector automates fetching papers from Google Scholar, enriches them with Crossref, and downloads PDFs via arXiv-first then Sci-Hub, producing a metadata spreadsheet and a missing/failed report. It saves time versus manual downloading by auto-grabbing available PDFs and clearly listing any papers it couldn’t fetch for easy follow‑up.

Author: Xuewen Zhang ([email protected], 2026).


Content


I. How to use

1. Install required packages

python -m pip install -r requirements.txt

or Go to Release page to download the software installer.

2. Run the pipeline (Scholar Crossref PDFs)

  • From a Scholar profile (auto-paginates, pagesize 100):
python main.py --profile-url "google scholar profile page url" --workdir _results/demo
  • From a "cited by" search:
python main.py --cited-url "google scholar paper 'cited by' page url" --workdir _results/cited_demo
  • Skip PDF downloads (metadata only):
python main.py --profile-url "..." --no-download
  • Helpful flags: --max-pages, --limit, --mailto [email protected], --proxy http://user:pass@host:port.

  • Example screenshot:

screenshot:

3. Start from an existing metadata sheet

If you already have titles/DOIs in Excel/CSV:

python main.py --from-xlsx /path/to/metadata.xlsx --workdir _results/from_sheet

4. Export cookies when Scholar blocks

Google Scholar may return captchas. Reuse your solved browser session:

  1. Open Scholar, solve the captcha.
  2. Install Get cookies.txt (Chrome/Edge).
  3. With the Scholar tab active: extension Export save as cookies.txt.
  4. Run with:
python main.py --profile-url "..." --cookies-file cookies.txt

Tips: keep cookies.txt private; its already Netscape format; ensure entries include scholar.google.com.


II. What happens under the hood

  1. Scrape Scholar (profile or cited-by) with delays and cookies/proxy support.
  2. Crossref enrichment with strict full-title matching (no short-title shortcuts) for DOI/venue/publisher URL.
  3. Metadata sheet: <workdir>/metadata.xlsx columns
    year, title, doi, publisher_link, venue, venue_abbrev, short_title, target_name, source_link.
  4. Download PDFs
    • First try direct arXiv PDF (https://arxiv.org/pdf/<id>.pdf) when DOI indicates arXiv.
    • Otherwise try Sci-Hub mirrors; if uncertain or blocked, mark as missing (no guessing).
  5. Report: <workdir>/missing_or_failed.md lists missing DOIs and failed downloads with source links.

report

---

III. Source tree

ref_collector/
├── README.md
├── main.py
├── requirements.txt
├── ref_collector/
│   ├── workflow.py              # Orchestrates scrape  enrich  download  report
│   ├── scholar_scraper.py       # Google Scholar profile / cited-by scraper
│   ├── metadata_enricher.py     # Crossref lookup + strict title matching
│   ├── naming.py                # Safe filenames, venue abbrev, title cleaning
│   └──  scihub_pdf_collector/
│       └──  download.py          # ArXiv-first, Sci-Hub-fallback PDF downloader + report
└── _results/ (created at runtime)

IV. ArXiv-aware downloading

  • DOIs like 10.48xxx/arXiv.1701.03xxx trigger a direct fetch from https://arxiv.org/pdf/1701.03xxx.pdf.
  • If that fails, the task falls back to Sci-Hub; if still unsuccessful, the item is reported as missing (no mismatched PDFs).

V. Troubleshooting

  • Scholar captcha: re-export cookies.txt, reduce --pagesize, increase --scrape-delay.
  • Wrong DOI risk: strict title matching (=0.90 similarity) prevents near misses; unmatched items stay DOI-empty.
  • No PDF: check missing_or_failed.md for source links; manually download if needed.
  • Proxies: apply to all stages with --proxy http://....

VI. Note

Note: metadata and downloads are automatically generated from Google Scholar + Crossref + Sci-Hub/arXiv. Wrong/mismatched papers can still slip through; please review results before critical use (e.g., publications or regulatory submissions).

Note: Downloads rely on Sci-Hub/arXiv. Recent papers (often post-2022) may be unavailable; expect higher failure rates for newly published or paywalled articles.


Release version (UI installer)

  • Download the latest installer in the release page (creates Start Menu/Desktop shortcuts).
  • Run the installer, then launch Reference Collector UI. The app uses the Forest ttk theme by RDBende (MIT) for a clean green look.
  • First launch will check for Python deps; allow the prompt to auto-install from requirements.txt.
  • Basic UI flow: paste a Scholar Profile/Cited-by URL, optionally tick Load metadata (xlsx/csv) to pick a sheet, choose Output path, optional cookies.txt, set a limit, and click Run. Colored log panel mirrors the CLI output.
  • Placeholder for UI screenshot (replace with your capture):

icon

UI screenshot


Acknowledgment

Developed and maintained by Xuewen Zhang ([email protected], 2026).
UI theme: Forest ttk theme by RDBende (MIT).

Citation

If this tool helps your work, please cite the project and the associated paper you are using it for.

License

Apache 2.0 see LICENSE.