A modular Reddit scraping tool that collects data about subreddits, posts, and users, and exports everything as structured JSON files for easy processing or database import.
-
Scrapes:
- ✔️ Subreddits
- ✔️ Posts
- ✔️ Users (kindof)
-
Outputs clean, structured JSON data
-
Includes tools to:
- Split large JSON files into smaller chunks
- Import JSON data into MongoDB
-
Fully automated workflow via
run.py
- Python 3.9+
- Packages listed in
requirements.txt
Sample JSON file sizes are big about 16-25mb each ,download them as browser will take time to load them.
Sample JSON: https://files.catbox.moe/r7a7um.json
Sample JSON: https://files.catbox.moe/5cf2xw.json
Sample JSON: https://files.catbox.moe/yp506n.json
| Script | Description |
|---|---|
subreddits.py |
Scrapes subreddit metadata |
posts.py |
Scrapes posts from each subreddit |
users.py |
Scrapes user information |
utils/split.py |
Splits large JSON files into import-friendly chunks |
utils/import_data_to_mongodb.sh |
Imports JSON chunks into MongoDB |
run.py |
Runs all scrapers sequentially |
pip install virtualenv
git clone https://github.com/glowfi/reddit-scraper
cd reddit-scraper
python -m venv env
source env/bin/activate # Linux / macOS
# or: env\Scripts\activate # Windows PowerShell
pip install -r requirements.txtEdit the file env-sample, then rename it to .env:
username=<RedditUsername>
password=<RedditPassword>
client_id=<Reddit API Client ID>
client_secret=<Reddit API Client Secret>
TOTAL_SUBREDDITS_PER_TOPICS=6
SUBREDDIT_SORT_FILTER="hot"
POSTS_PER_SUBREDDIT=10
POSTS_SORT_FILTER="new"Ensure your Reddit app is created at: https://www.reddit.com/prefs/apps
./run.pyThis will:
- Scrape subreddits
- Scrape posts
- Scrape users
- Save all data in this directory
- (Optional) Split files for MongoDB import
After scraping, use the helper script:
./utils/import_data_to_mongodb.shMake sure your MongoDB service is running beforehand.
- API limits apply; use reasonable configuration values
- Scraping speed depends on your network & Reddit API rate limiting
- JSON outputs are ready for further processing (ML, analytics, etc.)
Pull requests, issue reports, and improvements are welcome!



