|
| 1 | + |
| 2 | +# (PART) Web scraping {-} |
| 3 | + |
| 4 | +# rvest |
| 5 | + |
| 6 | +```{r warning=FALSE, message=FALSE} |
| 7 | +library(rvest) |
| 8 | +library(tidyverse) |
| 9 | +``` |
| 10 | + |
| 11 | +The [rvest package](https://rvest.tidyverse.org/index.html) (as in "harvest") allows you to scrape information from a web page and read it into R. In this chapter, we'll explain the basics of rvest and walk you through an example. |
| 12 | + |
| 13 | +## Web page basics |
| 14 | + |
| 15 | +### HTML |
| 16 | + |
| 17 | +HTML (_Hyper Text Markup Language_) defines the content and structure of a web page. In Chrome, you can view the HTML that generates a given web page by navigating to _View_ > _Developer_ > _Developer tools_. |
| 18 | + |
| 19 | +A series of elements, like paragraphs, headers, and tables, make up every HTML page. Here's a very simple webpage and the HTML that generates it. |
| 20 | + |
| 21 | +```{r echo=FALSE} |
| 22 | +knitr::include_graphics("images/rvest/html-example.png", dpi = image_dpi) |
| 23 | +``` |
| 24 | + |
| 25 | +The words surrounded by `< >` are HTML _tags_. Tags define where an element starts and ends. Elements, like paragraph (`<p>`), headings (`<h1>`), and tables (`<table>`), start with an opening tag (`<tagname>`) and end with the corresponding closing tag (`</tagname>`). |
| 26 | + |
| 27 | +Elements can be nested inside other elements. For example, notice that the `<tr>` tags, which generate rows of a table, are nested inside the `<table>` tag, and the `<td>` tags, which define the cells, are nested inside `<tr>` tags. |
| 28 | + |
| 29 | +The HTML contains all the information we'd need if we wanted to read the animal data into R, but we'll need rvest to extract the table and turn it into a data frame. |
| 30 | + |
| 31 | +### CSS |
| 32 | + |
| 33 | +CSS (_Cascading Style Sheets_) defines the appearance of HTML elements. _CSS selectors_ are often used to style particular subsets of elements, but you can also use them to extract elements from a web page. |
| 34 | + |
| 35 | +CSS selectors often reflect the structure of the web page. For example, the CSS selector for the example page's heading is |
| 36 | + |
| 37 | +`body > h1` |
| 38 | + |
| 39 | +and the selector for the entire table is |
| 40 | + |
| 41 | +`body > table` |
| 42 | + |
| 43 | +You don't need to generate CSS selectors yourself. In the next section, we'll show you how to use your browser to figure out the correct selector. |
| 44 | + |
| 45 | +## Scrape data with rvest |
| 46 | + |
| 47 | +[Our World in Data](https://ourworldindata.org) compiled data on world famines and made it available in [a table](https://ourworldindata.org/famines#the-our-world-in-data-dataset-of-famines). |
| 48 | + |
| 49 | +```{r echo=FALSE} |
| 50 | +knitr::include_graphics("images/rvest/famines-data.png", dpi = image_dpi) |
| 51 | +``` |
| 52 | + |
| 53 | +Using this table as an example, we'll show you how to use rvest to scrape a web page's HTML, read in a particular element, and then convert HTML to a data frame. |
| 54 | + |
| 55 | +### Read HTML |
| 56 | + |
| 57 | +First, copy the url of the webpage and store it in a parameter. |
| 58 | + |
| 59 | +```{r} |
| 60 | +url_data <- "https://ourworldindata.org/famines" |
| 61 | +``` |
| 62 | + |
| 63 | +Next, use `rvest::read_html()` to read all of the HTML into R. |
| 64 | + |
| 65 | +```{r} |
| 66 | +url_data %>% |
| 67 | + read_html() |
| 68 | +``` |
| 69 | + |
| 70 | +`read_html()` reads in all the html for the page. The page contains far more information than we need, so next we'll extract just the famines data table. |
| 71 | + |
| 72 | +### Find the CSS selector |
| 73 | + |
| 74 | +We'll find the CSS selector of the famines table and then use that selector to extract the data. |
| 75 | + |
| 76 | +In Chrome, right click on a cell near the top of the table. ^[The process is very similar in Safari.] |
| 77 | + |
| 78 | +```{r echo=FALSE} |
| 79 | +knitr::include_graphics("images/rvest/famines-inspect.png", dpi = image_dpi) |
| 80 | +``` |
| 81 | + |
| 82 | +The developer console will open and highlight the HTML element corresponding to the cell you clicked. |
| 83 | + |
| 84 | +```{r echo=FALSE} |
| 85 | +knitr::include_graphics( |
| 86 | + "images/rvest/famines-inspect-developer.png", |
| 87 | + dpi = image_dpi |
| 88 | +) |
| 89 | +``` |
| 90 | + |
| 91 | +Hovering over different HTML elements in the _Elements_ pane will highlight different parts of the webpage. |
| 92 | + |
| 93 | +```{r echo=FALSE} |
| 94 | +knitr::include_graphics("images/rvest/famines-hover.png", dpi = image_dpi) |
| 95 | +``` |
| 96 | + |
| 97 | +Move your mouse up the HTML document, hovering over different lines until the entire table (and only the table) is highlighted. This will often be a line with a `<table>` tag. |
| 98 | + |
| 99 | +```{r echo=FALSE} |
| 100 | +knitr::include_graphics( |
| 101 | + "images/rvest/famines-highlight-table.png", |
| 102 | + dpi = image_dpi |
| 103 | +) |
| 104 | +``` |
| 105 | + |
| 106 | +Right click on the line, then click _Copy_ > _Copy selector_. |
| 107 | + |
| 108 | +```{r echo=FALSE} |
| 109 | +knitr::include_graphics( |
| 110 | + "images/rvest/famines-copy-selector.png", |
| 111 | + dpi = image_dpi |
| 112 | +) |
| 113 | +``` |
| 114 | + |
| 115 | +Return to RStudio, create a variable for your CSS selector, and paste in the selector you copied. |
| 116 | + |
| 117 | +```{r} |
| 118 | +css_selector <- "#tablepress-73" |
| 119 | +``` |
| 120 | + |
| 121 | +### Extract the table |
| 122 | + |
| 123 | +You already saw how to read HTML into R with `rvest::read_html()`. Next, use `rvest::html_node()` to select just the element identified by your CSS selector. |
| 124 | + |
| 125 | +```{r} |
| 126 | +url_data <- "https://ourworldindata.org/famines" |
| 127 | +css_selector <- "#tablepress-73" |
| 128 | +
|
| 129 | +url_data %>% |
| 130 | + read_html() %>% |
| 131 | + html_node(css = css_selector) |
| 132 | +``` |
| 133 | + |
| 134 | +The data is still in HTML. Use `rvest::html_table()` to turn the output into a data frame. |
| 135 | + |
| 136 | +```{r} |
| 137 | +url_data %>% |
| 138 | + read_html() %>% |
| 139 | + html_node(css = css_selector) %>% |
| 140 | + html_table() |
| 141 | +``` |
| 142 | + |
| 143 | +Now, the data is ready for wrangling in R. |
| 144 | + |
| 145 | +Note that `html_table()` will only work if the HTML element you've supplied is a table. If, for example, wanted to extract a paragraph of text, we'd use `html_text()` instead. |
| 146 | + |
| 147 | +```{r} |
| 148 | +css_selector_paragraph <- |
| 149 | + "body > main > article > div.content-wrapper > div.offset-content > div > div > section:nth-child(1) > div > div:nth-child(1) > p:nth-child(9)" |
| 150 | +
|
| 151 | +url_data %>% |
| 152 | + read_html() %>% |
| 153 | + html_node(css = css_selector_paragraph) %>% |
| 154 | + html_text() |
| 155 | +``` |
| 156 | + |
0 commit comments