An intelligent Python tool to extract and catalog software repositories from JOSS published papers
π― Features β’ π Quick Start β’ π Output β’ π οΈ Usage β’ π Statistics
|
|
|
|
# 1. Open in Codespaces (click badge above)
# 2. Install uv
# 3. Create the Virtual Environment
# 4. Activate the environment (#Linux)
pip install uv
uv venv
source .venv/bin/activate
# Run only if the Requirement file is present
uv pip install -r requirements.txt
# Run only if the Requirement file is absent
uv pip install requests beautifulsoup4
python joss_extractor.py
python helmholtzRSD_extractor.py
The script generates a timestamped CSV file with software repositories:
software_repository
"https://github.com/example/awesome-tool"
"https://gitlab.com/research/data-analyzer"
"https://codeberg.org/dev/ml-framework"
joss_repositories_YYYYMMDD_HHMMSS.csv
Helmholtz_software_repositories_YYYYMMDD_HHMMSS.csv
Example: joss_repositories_20250805_143022.csv
python joss_extractor.py
python helmholtzRSD_extractor.py
π JOSS Papers Data Extractor
==================================================
π Started at: 2025-08-05 14:30:15
Fetching page 1/156...
β Retrieved 20 papers (Total: 20)
Fetching page 2/156...
β Retrieved 20 papers (Total: 40)
...
============================================================
π EXTRACTION SUMMARY
============================================================
π₯ Total papers processed: 3,111
π Records written to CSV: 3,089
β Papers without repositories: 22
π Repository coverage: 99.3%
π Output file: joss_repositories_20250805_143022.csv
π Extraction completed at: 2025-08-05 14:32:18
π VERIFICATION:
β
Processed 3,111 papers from API
β
Wrote 3,089 repository URLs to CSV
β
Data integrity: 3,089 + 22 = 3,111 β
β±οΈ Total execution time: 123.4 seconds
Metric | Typical Value |
---|---|
Total Papers | ~3,100+ |
Repository Coverage | ~99% |
Execution Time | 2-5 minutes |
Output Size | ~200KB |
API Pages | ~156 pages |
- Python 3.6+
requests
library- Internet connection
- Base URL:
https://joss.theoj.org/papers/published.json
- Pagination: 20 records per page
- Total Pages: ~156 pages
- Rate Limiting: 100ms delay between requests
- Fetch all pages from JOSS API
- Filter papers with valid repository URLs
- Format URLs with explicit quotes
- Export to timestamped CSV file
- Verify data integrity
This project was generated with the assistance of Claude AI. Contributions are welcome!
-
Fork the repository
-
Create your feature branch (
git checkout -b feature/AmazingFeature
) -
Commit your changes (
git commit -m 'Add some AmazingFeature'
) -
Push to the branch (
git push origin feature/AmazingFeature
) -
Open a Pull Request
-
[TODO : Fix Licence extraction logic for non GITHUB repo's]
This project is open source and available under the MIT License.
- JOSS - For providing the excellent API
- Claude AI - For assisting in code generation
- GitHub Codespaces - For seamless development environment