Reddit Scraper

A modular Reddit scraping tool that collects data about subreddits, posts, and users, and exports everything as structured JSON files for easy processing or database import.

🚀 Features

Scrapes:
- ✔️ Subreddits
- ✔️ Posts
- ✔️ Users (kindof)
Outputs clean, structured JSON data
Includes tools to:
- Split large JSON files into smaller chunks
- Import JSON data into MongoDB
Fully automated workflow via run.py

📦 Dependencies

Python 3.9+
Packages listed in requirements.txt

📂 Output Data Structure

Sample JSON file sizes are big about 16-25mb each ,download them as browser will take time to load them.

🛠️ Script Overview

Script	Description
`subreddits.py`	Scrapes subreddit metadata
`posts.py`	Scrapes posts from each subreddit
`users.py`	Scrapes user information
`utils/split.py`	Splits large JSON files into import-friendly chunks
`utils/import_data_to_mongodb.sh`	Imports JSON chunks into MongoDB
`run.py`	Runs all scrapers sequentially

⚙️ Installation & Setup

1. Clone the repository & create a virtual environment

pip install virtualenv
git clone https://github.com/glowfi/reddit-scraper
cd reddit-scraper

python -m venv env
source env/bin/activate   # Linux / macOS
# or: env\Scripts\activate  # Windows PowerShell

pip install -r requirements.txt

2. Configure environment variables

Edit the file env-sample, then rename it to .env:

username=<RedditUsername>
password=<RedditPassword>
client_id=<Reddit API Client ID>
client_secret=<Reddit API Client Secret>

TOTAL_SUBREDDITS_PER_TOPICS=6
SUBREDDIT_SORT_FILTER="hot"
POSTS_PER_SUBREDDIT=10
POSTS_SORT_FILTER="new"

Ensure your Reddit app is created at: https://www.reddit.com/prefs/apps

3. Run the scraper

./run.py

This will:

Scrape subreddits
Scrape posts
Scrape users
Save all data in this directory
(Optional) Split files for MongoDB import

🗄️ Importing Data Into MongoDB

After scraping, use the helper script:

./utils/import_data_to_mongodb.sh

Make sure your MongoDB service is running beforehand.

💡 Notes

API limits apply; use reasonable configuration values
Scraping speed depends on your network & Reddit API rate limiting
JSON outputs are ready for further processing (ML, analytics, etc.)

🤝 Contributing

Pull requests, issue reports, and improvements are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
images		images
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cleanup.sh		cleanup.sh
env-sample		env-sample
ondemand.json		ondemand.json
posts.py		posts.py
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
run.py		run.py
subreddits.py		subreddits.py
topic.py		topic.py
topics.json		topics.json
users.py		users.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reddit Scraper

🚀 Features

📦 Dependencies

📂 Output Data Structure

Subreddit document example

Post document example

User document example

🛠️ Script Overview

⚙️ Installation & Setup

1. Clone the repository & create a virtual environment

2. Configure environment variables

3. Run the scraper

🗄️ Importing Data Into MongoDB

💡 Notes

🤝 Contributing

About

Uh oh!

Uh oh!

Languages

License

glowfi/reddit-scraper

Folders and files

Latest commit

History

Repository files navigation

Reddit Scraper

🚀 Features

📦 Dependencies

📂 Output Data Structure

Subreddit document example

Post document example

User document example

🛠️ Script Overview

⚙️ Installation & Setup

1. Clone the repository & create a virtual environment

2. Configure environment variables

3. Run the scraper

🗄️ Importing Data Into MongoDB

💡 Notes

🤝 Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages