Skip to content

Collect/process data via various data sources : website / js website / API. Run scrapping pipeline via Celery, and Travis cron task. Dump the scraped data to slack

Notifications You must be signed in to change notification settings

dbendelman/web_scraping

 
 

Repository files navigation

web_scraping

Collection of scrapper pipelines build for different purposes

Build Status PRs

Quick Start

Quick start via docker
# Run via docker 
$ cd ~ && git clone https://github.com/yennanliu/web_scraping
$ cd ~ && cd web_scraping &&  docker-compose -f  docker-compose.yml up 
Quick start manually
# Run manually 
# dev 

File structure

├── Dockerfile
├── README.md
├── api.                  : Celery api (broker, job accepter(flask))
│   ├── Dockerfile        : Dockerfile build celery api 
│   ├── app.py            : Flask server accept job request(api)
│   ├── requirements.txt
│   └── worker.py         : Celery broker, celery backend(redis)
├── celery-queue          : Run main web scrapping jobs (via celery)
│   ├── Dockerfile        : Dockerfile build celery-queue
│   ├── IndeedScrapper    : Scrapper scrape Indeed.com 
│   ├── requirements.txt
│   └── tasks.py          : Celery run scrapping tasks 
├── cron_indeed_scrapping_test.py
├── cron_test.py
├── docker-compose.yml    : docker-compose build whole system : api, celery-queue, redis, and flower(celery job monitor)
├── legacy_project        
├── logs                  : Save running logs 
├── output                : Save scraped data 
├── requirements.txt
└── travis_push_github.sh : Script auto push output to github via Travis 

Tech

  • Celery : parallel/single thread python tasks management tool (celery broker/worker)
  • Redis : key-value DB save task data
  • Flower : UI monitor celery tasks
  • Flask : python light web framework, as project backend server here
  • Docker : build the app environment

Todo

TODO
### Project level

0. Deploy to Heroku cloud and make the scrapper as an API service 
1. Dockerize the project 
2. Run the scrapping (cron/paralel)jobs via Celery 
4. Add test (unit/integration test) 
5. Design DB model that save scrapping data systematically 

### Programming level 

1. Add utility scripts that can get XPATH of all objects in html
2. Workflow that automate whole processes
3. Job management 
	- Multiprocessing
	- Asynchronous
	- Queue 
4. Scrapping tutorial 
5. Scrapy, Phantomjs 

### Others 

1. Web scrapping 101 tutorial 

Ref

Ref - Scraping via Celery - https://www.pythoncircle.com/post/518/scraping-10000-tweets-in-60-seconds-using-celery-rabbitmq-and-docker-cluster-with-rotating-proxy/

About

Collect/process data via various data sources : website / js website / API. Run scrapping pipeline via Celery, and Travis cron task. Dump the scraped data to slack

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 94.4%
  • Python 4.9%
  • Other 0.7%