ispider_core

ispider is a module to spider websites

Multicore and multithreaded
Accepts hundreds/thousands of websites/domains as input
Sparse requests to avoid repeated calls against the same domain
The httpx engine works in asyncio blocks defined by settings.ASYNC_BLOCK_SIZE, so total concurrent threads are ASYNC_BLOCK_SIZE * POOLS
It supports retry with different engines (httpx, curl, seleniumbase [testing])

It was designed for maximum speed, so it has some limitations:

As of v0.7, it does not support files (pdf, video, images, etc); it only processes HTML

HOW IT WORKS - SIMPLE

-- Crawl - Depth == 0

Get all the landing pages for domains in the provided list.
If "robots" is selected, download the robots.txt file.
If "sitemaps" is selected, parse the robots.txt and retrieve all the sitemaps.
All data is saved under USER_DATA/data/dumps/dom_tld.

-- Spider - Depth > 0

Extract all links from landing pages and sitemaps.
Download the HTML pages, extract internal links, and follow them recursively.

HOW IT WORKS - MORE DETAILED

Crawl - Depth == 0

Create objects in the form (('https://domain.com', 'landing_page', 'domain.com', depth, retries, engine))
Add them to the LIFO queue qout
A thread retrieves elements from qout in variable-size blocks (depending on QUEUE_MAX_SIZE)
Fill a FIFO queue qin
Different workers (defined in settings.POOLS) get elements from qin and download them to USER_DATA/data/dumps/dom_tld
Landing pages are saved as _.html
Each worker processes the landing page; if the result is OK (status_code == 200), it tries to get robots.txt
On failure, it tries the next available engine (fallback)
It creates an object (('https://domain.com/robots.txt', 'robots', 'domain.com', depth=1, retries=0, engine))
Each worker retrieves the robots.txt; if "sitemaps" is defined in settings.CRAWL_METHODS, it attempts to get all sitemaps from robots.txt and dom_tld/sitemaps.xml
It creates objects (('https://domain.com/sitemap.xml', 'sitemaps', 'domain.com', depth=1, retries=0, engine)) and for other sitemaps found in robots.txt
Every successful or failed download is logged as a row in USER_FOLDER/jsons/crawl_conn_meta*json with all information available from the engine; these files are useful for statistics/reports from the spider
When there are no more elements in qin, after a 90-second timeout, jobs stop.

Spider - Depths > 0

It reads entries from USER_FOLDER/jsons/crawl_conn_meta*json for the domains in the list
It retrieves landing pages and sitemaps
If sitemaps are compressed, it uncompresses them
Extract all links from landing pages and sitemaps
Create objects (('https://domain.com/link1', 'internals', 'domain.com', depth=2, retries=0, engine))
Use the same engine that was used for the last successful request to the domain TLD
Add these objects to qout
Thread qin moves blocks from qout to qin, sparsing them
Download all links, save them, and save data in JSON
Parse the HTML, extract all INTERNAL links, follow them recursively, increasing depth

Schema

This is the projectual schema of the crawler/spider

USAGE

Install it

pip install ispider

First use

from ispider_core import ISpider

if __name__ == '__main__':
    # Check the readme for the complete avail parameters
    config_overrides = {
        'USER_FOLDER': '/Your/Dump/Folder',
        'POOLS': 64,
        'ASYNC_BLOCK_SIZE': 32,
        'MAXIMUM_RETRIES': 2,
        'CRAWL_METHODS': [],
        'CODES_TO_RETRY': [430, 503, 500, 429],
        'CURL_INSECURE': True,
        'ENGINES': ['curl']
    }

    # Specify a list of domains
    doms = ['domain1.com', 'domain2.com'....]

    # Run
    with ISpider(domains=doms, **config_overrides) as spider:
        spider.run()

TO KNOW

At first execution,

It creates the folder settings.USER_FOLDER
It downloads the file in settings.USER_FOLDER/sources/

https://raw.githubusercontent.com/danruggi/ispider/dev/static/exclude_domains.csv

that's a list of almost-infinite domains that would retain the script forever (or other domains too that were not needed in my project) You can update the file in ~/.ispider/sources

It creates settings.USER_FOLDER/data/ with dumps/ and jsons/
settings.USER_FOLDER/data/dumps are the downloaded websites
settings.USER_FOLDER/data/jsons are the connection results for every request

SETTINGS

Actual default settings are:

    """
    ## *********************************
    ## GENERIC SETTINGS
    # Output folder for controllers, dumps and jsons
    USER_FOLDER = "~/.ispider/"

    # Log level
    LOG_LEVEL = 'DEBUG'

    ## i.e., status_code = 430
    CODES_TO_RETRY = [430, 503, 500, 429]
    MAXIMUM_RETRIES = 2

    # Delay time after some status code to be retried
    TIME_DELAY_RETRY = 0

    ## Number of concurrent connection on the same process during crawling
    # Concurrent por process
    ASYNC_BLOCK_SIZE = 4

    # Concurrent processes (number of cores used, check your CPU spec)
    POOLS = 4

    # Max timeout for connecting,
    TIMEOUT = 5

    # This need to be a list, 
    # curl is used as subprocess, so be sure you installed it on your system
    # Retry will use next available engine.
    # The script begins wit the suprfast httpx
    # If fail, try with curl
    # If fail, it tries with seleniumbase, headless and uc mode activate
    ENGINES = ['httpx', 'curl', 'seleniumbase']

    CURL_INSECURE = False

    ## *********************************
    # CRAWLER
    # File size 
    # Max file size dumped on the disk. 
    # This to avoid big sitemaps with errors.
    MAX_CRAWL_DUMP_SIZE = 52428800

    # Max depth to follow in sitemaps
    SITEMAPS_MAX_DEPTH = 2

    # Crawler will get robots and sitemaps too
    CRAWL_METHODS = ['robots', 'sitemaps']

    ## *********************************
    ## SPIDER
    # Queue max, till 1 billion is ok on normal systems
    QUEUE_MAX_SIZE = 100000

    # Max depth to follow in websites
    WEBSITES_MAX_DEPTH = 2

    # This is not implemented yet
    MAX_PAGES_POR_DOMAIN = 1000000

    # This try to exclude some kind of files
    # It also test first bits of content of some common files, 
    # to exclude them even if online element has no extension
    EXCLUDED_EXTENSIONS = [
        "pdf", "csv",
        "mp3", "jpg", "jpeg", "png", "gif", "bmp", "tiff", "webp", "svg", "ico", "tif",
        "jfif", "eps", "raw", "cr2", "nef", "orf", "arw", "rw2", "sr2", "dng", "heif", "avif", "jp2", "jpx",
        "wdp", "hdp", "psd", "ai", "cdr", "ppsx"
        "ics", "ogv",
        "mpg", "mp4", "mov", "m4v",
        "zip", "rar"
    ]

    # Exclude all urls that contains this REGEX
    EXCLUDED_EXPRESSIONS_URL = [
        # r'test',
    ]

    """

NOTES

Deduplication is not 100% safe, sometimes pages are downloaded multiple times, and skipped in file check. On ~10 domains, check duplication has small delay. But on 10000 domains after 500k links, the domain list is so big that checking if a link is already downloaded or not was decreasing considerably the speed (from 30000 urls/min to 300 urls/min). That's why I preferred avoid a list, and left just "check file".

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
ispider_core		ispider_core
static		static
tests		tests
.gitignore		.gitignore
.heartbeat		.heartbeat
LICENCE		LICENCE
MANIFEST.in		MANIFEST.in
README.md		README.md
ispider_results.db		ispider_results.db
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ispider_core

HOW IT WORKS - SIMPLE

HOW IT WORKS - MORE DETAILED

Crawl - Depth == 0

Spider - Depths > 0

Schema

USAGE

TO KNOW

SETTINGS

NOTES

About

Uh oh!

Releases 8

Packages

Uh oh!

Languages

License

danruggi/ispider

Folders and files

Latest commit

History

Repository files navigation

ispider_core

HOW IT WORKS - SIMPLE

HOW IT WORKS - MORE DETAILED

Crawl - Depth == 0

Spider - Depths > 0

Schema

USAGE

TO KNOW

SETTINGS

NOTES

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Languages

Packages