scp_crawler

This is a web crawler built with scrapy and designed to extract data from the SCP Wiki.

Installation

make install

Simple Crawl

Then to run all of the spiders and create a full data dump of the SCP Wiki and SCP International Hub in the data directory:

make crawl

Custom Crawl with scrapy cli

Individual spiders with custom settings can also be called using the scrapy command line tool.

To show Available Spiders:

scrapy list

To crawl the International Hub for SCP Items and save to a custom location:

scrapy crawl scp_int -o scp_international_items.json

Raw Content Structure

There are two types of content downloaded- SCP Items and SCP Tales.

All content (both SCP Items and Tales) contain the following:

URL
Title
Rating
Tags
History- revision ID, date, author, and comment.
Raw Content (the HTML for the story or item, without the site navigation and other boilerplate)

In addition the SCP Items include:

SCP Identifier (ie, SCP-3000)
SCP Number (if available)
SCP Series
- 1-5 (with built in support for future published series)
- joke, explained, and decommissioned
- Generic International (from the main site)
- Specific Nationality Tag (from the international hub)

Generated Files

The crawler generates a series of json files containing an array of objects representing each crawled item.

File	Source	Type	Target
goi.json	Main	Tale	goi
scp_items.json	Main	Item	scp
scp_titles.json	Main	Title	scp
scp_hubs.json	Main	Hub	scp
scp_tales.json	Main	Tale	scp
scp_int.json	International	Item	scp_int
scp_int_titles.json	International	Title	scp_int
scp_int_tales.json	International	Tale	scp_int

Running make TARGET (such as make goi or make scp) will generate the site specific files. Running make data will fill in any missing files.

To regenerate all files run make fresh.

Post Processed Data

The postproc system takes the Titles, Hubs, Items, and Tales and uses them to generate a comprehensive set of objects. It combines and cross references data and expands on the data already there.

Content Licensing

Text content on the SCP Wikis is available under the CC BY-SA 3.0 license.

This project does not download images.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
scp_crawler		scp_crawler
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
makefile		makefile
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scp_crawler

Installation

Simple Crawl

Custom Crawl with scrapy cli

Raw Content Structure

Generated Files

Post Processed Data

Content Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

scp-data/scp_crawler

Folders and files

Latest commit

History

Repository files navigation

scp_crawler

Installation

Simple Crawl

Custom Crawl with scrapy cli

Raw Content Structure

Generated Files

Post Processed Data

Content Licensing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages