generated from dcl-docs/book
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathrvest.Rmd
153 lines (100 loc) · 5.25 KB
/
rvest.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# (PART) Web scraping {-}
# rvest
```{r warning=FALSE, message=FALSE}
library(tidyverse)
library(rvest)
```
The [rvest](https://rvest.tidyverse.org/index.html) package (as in "harvest") allows you to scrape information from a web page and read it into R. In this chapter, we'll explain the basics of rvest and walk you through an example.
## Web page basics
### HTML
HTML (_Hyper Text Markup Language_) defines the content and structure of a web page. In Chrome, you can view the HTML that generates a given web page by navigating to _View_ > _Developer_ > _Developer tools_.
A series of elements, like paragraphs, headers, and tables, make up every HTML page. Here's a very simple web page and the HTML that generates it.
```{r echo=FALSE}
knitr::include_graphics("images/rvest/html-example.png", dpi = image_dpi)
```
The words surrounded by `< >` are HTML _tags_. Tags define where an element starts and ends. Elements, like paragraph (`<p>`), headings (`<h1>`), and tables (`<table>`), start with an opening tag (`<tagname>`) and end with the corresponding closing tag (`</tagname>`).
Elements can be nested inside other elements. For example, notice that the `<tr>` tags, which generate rows of a table, are nested inside the `<table>` tag, and the `<td>` tags, which define the cells, are nested inside `<tr>` tags.
The HTML contains all the information we'd need if we wanted to read the animal data into R, but we'll need rvest to extract the table and turn it into a data frame.
### CSS
CSS (_Cascading Style Sheets_) defines the appearance of HTML elements. _CSS selectors_ are often used to style particular subsets of elements, but you can also use them to extract elements from a web page.
CSS selectors often reflect the structure of the web page. For example, the CSS selector for the example page's heading is
`body > h1`
and the selector for the entire table is
`body > table`
You don't need to generate CSS selectors yourself. In the next section, we'll show you how to use your browser to figure out the correct selector.
## Scrape data with rvest
[Our World in Data](https://ourworldindata.org) compiled data on world famines and made it available in [a table](https://ourworldindata.org/famines#the-our-world-in-data-dataset-of-famines).
```{r echo=FALSE}
knitr::include_graphics("images/rvest/famines-data.png", dpi = image_dpi)
```
Using this table as an example, we'll show you how to use rvest to scrape a web page's HTML, read in a particular element, and then convert HTML to a data frame.
### Read HTML
First, copy the url of the web page and store it in a parameter.
```{r}
url_data <- "https://ourworldindata.org/famines"
```
Next, use `rvest::read_html()` to read all of the HTML into R.
```{r}
url_data %>%
read_html()
```
`read_html()` reads in all the HTML for the page. The page contains far more information than we need, so next we'll extract just the famines data table.
### Find the CSS selector
We'll find the CSS selector of the famines table and then use that selector to extract the data.
In Chrome, right click on a cell near the top of the table, then click _Inspect_ (or _Inspect element_ in Safari or Firefox).
```{r echo=FALSE}
knitr::include_graphics("images/rvest/famines-inspect.png", dpi = image_dpi)
```
The developer console will open and highlight the HTML element corresponding to the cell you clicked.
```{r echo=FALSE}
knitr::include_graphics(
"images/rvest/famines-inspect-developer.png",
dpi = image_dpi
)
```
Hovering over different HTML elements in the _Elements_ pane will highlight different parts of the web page.
```{r echo=FALSE}
knitr::include_graphics("images/rvest/famines-hover.png", dpi = image_dpi)
```
Move your mouse up the HTML document, hovering over different lines until the entire table (and only the table) is highlighted. This will often be a line with a `<table>` tag.
```{r echo=FALSE}
knitr::include_graphics(
"images/rvest/famines-highlight-table.png",
dpi = image_dpi
)
```
Right click on the line, then click _Copy_ > _Copy selector_ (Firefox: _Copy_ > _CSS selector_; Safari: _Copy_ > _Selector Path_).
```{r echo=FALSE}
knitr::include_graphics(
"images/rvest/famines-copy-selector.png",
dpi = image_dpi
)
```
Return to RStudio, create a variable for your CSS selector, and paste in the selector you copied.
```{r}
css_selector <- "#tablepress-73"
```
### Extract the table
You already saw how to read HTML into R with `rvest::read_html()`. Next, use `rvest::html_element()` to select just the element identified by your CSS selector.
```{r}
url_data %>%
read_html() %>%
html_element(css = css_selector)
```
The data is still in HTML. Use `rvest::html_table()` to turn the output into a tibble.
```{r}
url_data %>%
read_html() %>%
html_element(css = css_selector) %>%
html_table()
```
Now, the data is ready for wrangling in R.
Note that `html_table()` will only work if the HTML element you've supplied is a table. If, for example, we wanted to extract a paragraph of text, we'd use `html_text()` instead.
```{r}
css_selector_paragraph <-
"body > main > article > div.content-wrapper > div.offset-content > div > div > section:nth-child(1) > div > div:nth-child(1) > p:nth-child(9)"
url_data %>%
read_html() %>%
html_element(css = css_selector_paragraph) %>%
html_text()
```