Skip to content

rakeshmukundan/pyspider

This branch is 1002 commits behind binux/pyspider:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Nov 14, 2014
47ba0c8 · Nov 14, 2014
Mar 6, 2014
Nov 11, 2014
Nov 10, 2014
Nov 14, 2014
Nov 10, 2014
Nov 11, 2014
Oct 31, 2014
Nov 11, 2014
Nov 5, 2014
Nov 11, 2014
Mar 16, 2014
Nov 11, 2014
Oct 31, 2014
Feb 21, 2014
Nov 14, 2014
Apr 3, 2014
Oct 27, 2014
Nov 5, 2014
Oct 31, 2014

Repository files navigation

pyspider Build Status Coverage Status

A Powerful Spider System in Python. Try It Now!

  • Write script in python with powerful API
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • MySQL, MongoDB, SQLite as database backend
  • Javascript pages supported!
  • Task priority, retry, periodical and recrawl by age or marks in index page (like update time)
  • Distributed architecture

Sample Code:

from libs.base_handler import *

class Handler(BaseHandler):
    '''
    this is a sample handler
    '''
    @every(minutes=24*60, seconds=0)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10*24*60*60)
    def index_page(self, response):
        for each in response.doc('a[href^="http://"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
                "url": response.url,
                "title": response.doc('title').text(),
                }

demo

Installation

if ubuntu: apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml

or Run with Docker

Documents

Contribute

License

Licensed under the Apache License, Version 2.0

About

A Powerful Spider System with Powerful WebUI

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 81.4%
  • JavaScript 11.2%
  • CSS 7.2%
  • Shell 0.2%