A Python-based web scraper and crawler designed to extract structured data from various websites — including those with anti-scraping techniques like custom headers and CSRF protection.
- Books to Scrape
- Scraping Course – CSRF Protected Login
- Scrape This Site – Advanced Headers Challenge
- ✅ Custom headers to bypass basic anti-bot detection
- ✅ Automatic pagination support
- ✅ CSRF token retrieval and session-based login handling
- ✅ Output data to JSON
Scraper/
├── Sync/ # Synchronous scraping modules
│ ├── Categories.py # Gets all the books categories
│ ├── NamePrice.py # Extracts book names and prices
│ └── Total.py # Extracts book names and prices as per their categories
│
├── async.py # Asynchronous scraping module
├── header.py # Manages headers/user-agents
├── Login.py # Handles login/authentication
└── Scrape.json # Scraping output for "async.py"
- Python 3.11.9
- requests – HTTP requests
- BeautifulSoup – HTML parsing
- asyncio, aiohttp
- json
- Clone the Repository
git clone https://github.com/Argu333/Scraper.git
cd Scraper
- Install the used libraries (if not installed)
pip install requests
pip install beautifulsoup4
pip install aiohttp