This is a web crawler built with scrapy and designed to extract data from the SCP Wiki.
make install
Then to run all of the spiders and create a full data dump of the SCP Wiki and SCP International Hub in the data directory:
make crawlIndividual spiders with custom settings can also be called using the scrapy command line tool.
To show Available Spiders:
scrapy listTo crawl the International Hub for SCP Items and save to a custom location:
scrapy crawl scp_int -o scp_international_items.jsonThere are two types of content downloaded- SCP Items and SCP Tales.
All content (both SCP Items and Tales) contain the following:
- URL
- Title
- Rating
- Tags
- History- revision ID, date, author, and comment.
- Raw Content (the HTML for the story or item, without the site navigation and other boilerplate)
In addition the SCP Items include:
- SCP Identifier (ie, SCP-3000)
- SCP Number (if available)
- SCP Series
- 1-5 (with built in support for future published series)
- joke, explained, and decommissioned
- Generic International (from the main site)
- Specific Nationality Tag (from the international hub)
The crawler generates a series of json files containing an array of objects representing each crawled item.
| File | Source | Type | Target |
|---|---|---|---|
| goi.json | Main | Tale | goi |
| scp_items.json | Main | Item | scp |
| scp_titles.json | Main | Title | scp |
| scp_hubs.json | Main | Hub | scp |
| scp_tales.json | Main | Tale | scp |
| scp_int.json | International | Item | scp_int |
| scp_int_titles.json | International | Title | scp_int |
| scp_int_tales.json | International | Tale | scp_int |
Running make TARGET (such as make goi or make scp) will generate the site specific files. Running make data will fill in any missing files.
To regenerate all files run make fresh.
The postproc system takes the Titles, Hubs, Items, and Tales and uses them to generate a comprehensive set of objects. It combines and cross references data and expands on the data already there.
Text content on the SCP Wikis is available under the CC BY-SA 3.0 license.
This project does not download images.