Skip to content

priya-gitTest/JOSS_SoftwareRepositoryExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ JOSS + Helmholtz(RSD) Software Repository Extractor

JOSS Python GitHub Codespaces AI Generated

An intelligent Python tool to extract and catalog software repositories from JOSS published papers

🎯 Features β€’ πŸš€ Quick Start β€’ πŸ“Š Output β€’ πŸ› οΈ Usage β€’ πŸ“ˆ Statistics


🎯 Features

⚑ Fast & Efficient

  • Scans all 3,100+ published papers in minutes (2 mins approx)
  • Rate-limited API calls to respect server
  • Progress tracking with real-time updates

🎯 Smart Extraction

  • Filters out papers without repositories
  • Handles edge cases and malformed URLs
  • Comprehensive error handling

πŸ“Š Detailed Analytics

  • Processing vs output record counts
  • Repository coverage statistics
  • Data integrity verification

πŸ“ Professional Output

  • Timestamped CSV files
  • Quoted URL format
  • UTF-8 encoding support

πŸš€ Quick Start

πŸ™ GitHub Codespaces (Recommended)

Open in GitHub Codespaces

# 1. Open in Codespaces (click badge above)
# 2. Install uv
# 3. Create the Virtual Environment
# 4. Activate the environment (#Linux)
pip install uv
uv venv
source .venv/bin/activate
# Run only if the Requirement file is present
uv pip install -r requirements.txt

5. Install Packages

# Run only if the Requirement file is absent
uv pip install requests beautifulsoup4

6. Run the JOSS extractor

python joss_extractor.py

7. Run the Helmholtz(RSD) extractor

python helmholtzRSD_extractor.py

πŸ“Š Output

The script generates a timestamped CSV file with software repositories:

software_repository
"https://github.com/example/awesome-tool"
"https://gitlab.com/research/data-analyzer"
"https://codeberg.org/dev/ml-framework"

πŸ“ File Naming Convention

joss_repositories_YYYYMMDD_HHMMSS.csv
Helmholtz_software_repositories_YYYYMMDD_HHMMSS.csv

Example: joss_repositories_20250805_143022.csv

πŸ› οΈ Usage

Basic Usage

python joss_extractor.py
python helmholtzRSD_extractor.py

Expected Output

πŸš€ JOSS Papers Data Extractor
==================================================
πŸ•’ Started at: 2025-08-05 14:30:15

Fetching page 1/156...
  β†’ Retrieved 20 papers (Total: 20)
Fetching page 2/156...
  β†’ Retrieved 20 papers (Total: 40)
...

============================================================
πŸ“Š EXTRACTION SUMMARY
============================================================
πŸ“₯ Total papers processed: 3,111
πŸ“ Records written to CSV: 3,089
❌ Papers without repositories: 22
πŸ“ˆ Repository coverage: 99.3%
πŸ“ Output file: joss_repositories_20250805_143022.csv
πŸ•’ Extraction completed at: 2025-08-05 14:32:18

πŸ” VERIFICATION:
βœ… Processed 3,111 papers from API
βœ… Wrote 3,089 repository URLs to CSV
βœ… Data integrity: 3,089 + 22 = 3,111 βœ“

⏱️ Total execution time: 123.4 seconds

πŸ“ˆ Statistics

Metric Typical Value
Total Papers ~3,100+
Repository Coverage ~99%
Execution Time 2-5 minutes
Output Size ~200KB
API Pages ~156 pages

πŸ”§ Technical Details

Requirements

  • Python 3.6+
  • requests library
  • Internet connection

API Details

  • Base URL: https://joss.theoj.org/papers/published.json
  • Pagination: 20 records per page
  • Total Pages: ~156 pages
  • Rate Limiting: 100ms delay between requests

Data Processing

  1. Fetch all pages from JOSS API
  2. Filter papers with valid repository URLs
  3. Format URLs with explicit quotes
  4. Export to timestamped CSV file
  5. Verify data integrity

🀝 Contributing

This project was generated with the assistance of Claude AI. Contributions are welcome!

  1. Fork the repository

  2. Create your feature branch (git checkout -b feature/AmazingFeature)

  3. Commit your changes (git commit -m 'Add some AmazingFeature')

  4. Push to the branch (git push origin feature/AmazingFeature)

  5. Open a Pull Request

  6. [TODO : Fix Licence extraction logic for non GITHUB repo's]

πŸ“„ License

This project is open source and available under the MIT License.

πŸ™ Acknowledgments


Made with ❀️ and AI assistance

GitHub stars GitHub forks

About

SoftwareRepositoryExtractor [ JOSS + Helmholtz RSD ]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages