Skip to content

Assignment #4, due Jan 13, 9am: Scrape Wikipedia data for current S&P 500 constituents #16

@joachim-gassen

Description

@joachim-gassen

Your task is to collect and tidy Wikipedia data for the companies that constitute the Standard & Poor’s 500 index. You can find a convenient list here. The idea is to scrape some data from each companies’ Wikipedia page and to prepare a tidy dataset that contains that data. You can decide yourself what data you want to collect for each constituent but things that come to mind are:

• The info in the top right infobox
• The length of the Wikipedia article
• Some info on its revision history

Clearly, this is not an exhaustive list. Collect whatever data you find to be interesting and can get in a standardized way for a reasonable subset of firms. The tidy datasets should be stored in the “data” directory. If you feel like it, you can also prepare an informative visual based on your scraped data.

You can use whatever packages or resources you find helpful for the task. As always, please make reference to all used resources in the code. Ideally, your code runs in the docker container. For Python users: Please submit plain Python code, no Jupyter notebooks please.

The deadline for this task is Monday, January 13th, 2020, 9am. Feel free to use this issue to discuss things that need clarification or to help each other.

Please note that I will be offline from Friday, 20th Dec, until Sunday, 5th Jan, 2020. Enjoy the break!

Metadata

Metadata

Assignees

No one assigned

    Labels

    assignmentIndividual class assignment

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions