Trial Project: Product spider

Goal

The goal of this trial project is to create a spider with Scrapy to scrape artistic work information from a museum website. The specs are detailed below.

Note: The museum website was created for this trial and both it and the spider have no commercial value.

Product spider

Create a Python 3.8+ spider to scrape all works in the "In Sunsh" and "Summertime" categories on the Scrapinghub Maybe Modern Art Collection website. It should navigate down the work browse tree (e.g.: Summertime / Wrapper From / Barn Owl) to the lowest level and parse on a per-work basis.

If a work does not have information for a specific field, please omit the field.

Fields:

url: (string) URL of the work being scraped
artist: (list of strings) List of artists for the work
title: (string) Title of the work
image: (string) URL of the image
height: (float) Physical height in cm, only if available in cm
width: (float) Physical width in cm, only if available in cm
description: (string) Description of the work
categories: (list of strings) Names of the categories visited to reach the item via the browse tree. Ex: ["Summertime", "Wrapper From", "Ao Shu"]

Getting started

Scrapy is an application framework for crawling web sites and extracting structured data. It is the primary library used for data extraction at Zyte, which is the main reason why this trial is focused on Scrapy.

Using a basic spider should be enough to complete the task (we include a Spider template at artworks/spiders/trial.py). However, we also have a comprehensive tutorial which provides deeper insight into Scrapy.

In addition, please see the assumptions.txt, feedback.txt and hours.txt files. The "Deliverable" section contains further information about them.

Running locally

Scrapy allows to run any spider locally.

scrapy crawl trial
More than that you can also debug with Scrapy. It can be really useful if you are getting responses different to the ones you see in the browser or would like to check selectors quickly:

$ scrapy shell <url_to_explore>

>>> # experiment with response

Running at Scrapy Cloud (AKA "SC")

This trial requires a crawling job at Scrapy Cloud as deliverable.

Zyte has its own command line tool to make Scrapy Cloud deployments, called shub.

Important steps to make deployments the right way:

Login to Scrapy Cloud by executing shub login

You'll need to provide your Zyte API key on this step. It can be found at https://app.zyte.com/account/apikey
Set the project id so shub would know the target for deployment.

scrapinghub.yml file needs to be updated with the correct project id. You can take it from the end of the URL you received with the invitation to the trial: https://app.zyte.com/p/<project_id>/

Attention: This project is configured to use scrapy == 2.5, so if you are using a different version you'll need to update scrapinghub.yml and requirements.txt accordingly. For more information about the available scrapy versions in SC, please refer to Changing the Deploy Environment With Scrapy Cloud Stacks
Run shub deploy to load code into Scrapy Cloud.
Run shub schedule trial to start a spider job on SC. It is also possible to run the spider in the UI by hitting Run button.
Check your project page at https://app.zyte.com/p/<project_id>/ to see the collected data.

Deliverable

The spider, committed and pushed to this repository

The spider should be written in Python 3.8+ and follow the PEP8 style guidelines. Please also commit shub's scrapinghub.yml file (with any sensitive information removed).

Please keep the followings points in mind when delivering your code:
1. Commit History. Do not worry about delivering a clean commit history, do as many commits as you would do while working normally. Please do not squash everything into a single commit, we use the commit history to review the evolution of your work.
2. Code quality. The final version of your code will be considered finished and production-ready - please make sure the spider results are complete and look correct upon inspection. You can check the online documentation to learn about Scrapy's features and best practices.
A complete run of your finished spider in Scrapy Cloud

You will receive an email with an invitation to a Scrapy Cloud project for this purpose (if you do not, please let us know). Instructions for deploying with Python 3 can be found here: https://helpdesk.zyte.com/support/solutions/articles/22000200387-deploying-python-3-spiders-to-scrapy-cloud
Spent time report

Please include a file called "hours.txt" with a list of tasks you worked on and the amounts of time you worked on them. You can be as detailed as you want, but a summary of high-level points is often enough (learning Scrapy, spider design, implementation, testing, etc). Keep in mind that this is not a time competition, a clean implementation is preferable over a quick one.
Assumptions and decisions report

Please document assumptions you made and reasons for decisions you made in a file called "assumptions.txt" in your repository. We will refer to this if your code contains things we did not expect.
Feedback

We would appreciate any feedback as per this trial. Please put it into "feedback.txt". Some specific questions are provided, but if you only have one thing to say, we're eager to receive any feedback at all. Please share!

Time limit

We expect this project to take around 8-10 hours depending on your level of experience. We do not want you to over-invest yourself in this project, so if you're still working after spending 16 hours please stop and submit what you've completed.

Deadlines

This project does not have a deadline. Just submit the results as soon as you have them ready (but no sooner) including the spent time report. The sooner you submit, the sooner we will move to the next step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trial Project: Product spider

Goal

Product spider

Getting started

Running locally

Running at Scrapy Cloud (AKA "SC")

Deliverable

Time limit

Deadlines

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
artworks		artworks
README.md		README.md
assumptions.txt		assumptions.txt
feedback.txt		feedback.txt
hours.txt		hours.txt
requirements.txt		requirements.txt
scrapinghub.yml		scrapinghub.yml

prorid3r/zyte_trial

Folders and files

Latest commit

History

Repository files navigation

Trial Project: Product spider

Goal

Product spider

Getting started

Running locally

Running at Scrapy Cloud (AKA "SC")

Deliverable

Time limit

Deadlines

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages