Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
binux committed Nov 14, 2014
1 parent 4b33143 commit 47ba0c8
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 32 deletions.
67 changes: 37 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,53 @@
pyspider [![Build Status](https://travis-ci.org/binux/pyspider.png?branch=master)](https://travis-ci.org/binux/pyspider) [![Coverage Status](https://coveralls.io/repos/binux/pyspider/badge.png)](https://coveralls.io/r/binux/pyspider)
========

A spider system in python. [Try It Now!](http://demo.pyspider.org/)
A Powerful Spider System in Python. [Try It Now!](http://demo.pyspider.org/)

- Write script with python
- Web script editor, debugger, task monitor, project manager and result viewer
- Write script in python with powerful API
- Powerful WebUI with script editor, task monitor, project manager and result viewer
- MySQL, MongoDB, SQLite as database backend
- Javascript pages supported!
- Task priority, retry, periodical and recrawl by age or marks in index page (like update time)
- Distributed architecture
- MySQL, MongoDB and SQLite as database backend
- Full control of crawl process with powerful API
- Javascript pages Support! (with phantomjs fetcher)


![debug demo](http://f.binux.me/debug_demo.png)
demo code: [gist:9424801](https://gist.github.com/binux/9424801)
Sample Code:

```python
from libs.base_handler import *

class Handler(BaseHandler):
'''
this is a sample handler
'''
@every(minutes=24*60, seconds=0)
def on_start(self):
self.crawl('http://scrapy.org/', callback=self.index_page)

@config(age=10*24*60*60)
def index_page(self, response):
for each in response.doc('a[href^="http://"]').items():
self.crawl(each.attr.href, callback=self.detail_page)

def detail_page(self, response):
return {
"url": response.url,
"title": response.doc('title').text(),
}
```

[![demo](http://ww1.sinaimg.cn/large/7d46d69fjw1emavy6e9gij21kw0uldvy.jpg)](http://demo.pyspider.org/)

Installation
============

* python2.6/2.7
* `pip install -r requirements.txt`
* `pip install --allow-all-external -r requirements.txt`
* `./run.py` , visit [http://localhost:5000/](http://localhost:5000/)

Docker
======
if ubuntu: `apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml`

```
# mysql
docker run -it -d --name mysql dockerfile/mysql
# rabbitmq
docker run -it -d --name rabbitmq dockerfile/rabbitmq
# phantomjs link to fetcher and webui
docker run --name phantomjs -it -d -v `pwd`:/mnt/test --expose 25555 cmfatih/phantomjs /usr/bin/phantomjs /mnt/test/fetcher/phantomjs_fetcher.js 25555
# scheduler
docker run -it -d --name scheduler --link mysql:mysql --link rabbitmq:rabbitmq binux/pyspider scheduler
# fetcher, run multiple instance if needed.
docker run -it -d -m 64m --link rabbitmq:rabbitmq binux/pyspider fetcher
# processor, run multiple instance if needed.
docker run -it -d -m 128m --link mysql:mysql --link rabbitmq:rabbitmq binux/pyspider processor
# webui
docker run -it -d -p 5000:5000 --link mysql:mysql --link rabbitmq:rabbitmq --link scheduler:scheduler binux/pyspider webui
```
or [Run with Docker](https://github.com/binux/pyspider/wiki/Run-pyspider-with-Docker)

Documents
=========
Expand All @@ -53,8 +60,8 @@ Documents
Contribute
==========

* 部署使用,提交 bug、特性 [Issue](https://github.com/binux/pyspider/issues)
* 参与 [特性讨论](https://github.com/binux/pyspider/issues?labels=discussion&state=open) [完善文档](https://github.com/binux/pyspider/wiki)
* Use It, Open [Issue](https://github.com/binux/pyspider/issues), PR is welcome.
* [Discuss](https://github.com/binux/pyspider/issues?labels=discussion&state=open) [Document](https://github.com/binux/pyspider/wiki)


License
Expand Down
3 changes: 1 addition & 2 deletions libs/sample_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Created on __DATE__

from libs.pprint import pprint
from libs.base_handler import *

class Handler(BaseHandler):
Expand All @@ -12,7 +11,7 @@ class Handler(BaseHandler):
'''
@every(minutes=24*60, seconds=0)
def on_start(self):
self.crawl('http://www.baidu.com/', callback=self.index_page)
self.crawl('http://scrapy.org/', callback=self.index_page)

@config(age=10*24*60*60)
def index_page(self, response):
Expand Down

0 comments on commit 47ba0c8

Please sign in to comment.