Skip to content

Commit 4aaa488

Browse files
author
skaltman
committed
add webscraping
1 parent 2957c75 commit 4aaa488

11 files changed

+196
-0
lines changed

_bookdown.yml

+1
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ rmd_files: [
1818
"census.Rmd",
1919
"googlesheets.Rmd",
2020
"read-write.Rmd",
21+
"web-scraping.Rmd",
2122
"references.Rmd"
2223
]
2324

_common.R

+1
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ options(
88

99
knitr::opts_chunk$set(
1010
comment = "#>",
11+
collapse = FALSE,
1112
fig.align = 'center',
1213
fig.asp = 0.618, # 1 / phi
1314
fig.show = "hold"

examples/html-example.html

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
<!doctype html>
2+
<html>
3+
4+
<head>
5+
<title>Page Title!</title>
6+
</head>
7+
8+
<body>
9+
10+
<h1>A page heading!</h1>
11+
12+
<p>This is a paragraph full of words.</p>
13+
14+
<p>Another paragraph, full of words.</p>
15+
16+
<h2>A table</h2>
17+
18+
<table>
19+
<thead>
20+
<tr>
21+
<th style="text-align:left;"> animal </th>
22+
<th style="text-align:right;"> n </th>
23+
</tr>
24+
</thead>
25+
<tbody>
26+
<tr>
27+
<td style="text-align:left;"> alpaca </td>
28+
<td style="text-align:right;"> 3 </td>
29+
</tr>
30+
<tr>
31+
<td style="text-align:left;"> llama </td>
32+
<td style="text-align:right;"> 8 </td>
33+
</tr>
34+
</tbody>
35+
</table>
36+
37+
</body>
38+
</html>
2.79 MB
Loading

images/rvest/famines-data.png

1.68 MB
Loading
2.65 MB
Loading

images/rvest/famines-hover.png

2.49 MB
Loading
2.52 MB
Loading

images/rvest/famines-inspect.png

1.76 MB
Loading

images/rvest/html-example.png

831 KB
Loading

web-scraping.Rmd

+156
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
2+
# (PART) Web scraping {-}
3+
4+
# rvest
5+
6+
```{r warning=FALSE, message=FALSE}
7+
library(rvest)
8+
library(tidyverse)
9+
```
10+
11+
The [rvest package](https://rvest.tidyverse.org/index.html) (as in "harvest") allows you to scrape information from a web page and read it into R. In this chapter, we'll explain the basics of rvest and walk you through an example.
12+
13+
## Web page basics
14+
15+
### HTML
16+
17+
HTML (_Hyper Text Markup Language_) defines the content and structure of a web page. In Chrome, you can view the HTML that generates a given web page by navigating to _View_ > _Developer_ > _Developer tools_.
18+
19+
A series of elements, like paragraphs, headers, and tables, make up every HTML page. Here's a very simple webpage and the HTML that generates it.
20+
21+
```{r echo=FALSE}
22+
knitr::include_graphics("images/rvest/html-example.png", dpi = image_dpi)
23+
```
24+
25+
The words surrounded by `< >` are HTML _tags_. Tags define where an element starts and ends. Elements, like paragraph (`<p>`), headings (`<h1>`), and tables (`<table>`), start with an opening tag (`<tagname>`) and end with the corresponding closing tag (`</tagname>`).
26+
27+
Elements can be nested inside other elements. For example, notice that the `<tr>` tags, which generate rows of a table, are nested inside the `<table>` tag, and the `<td>` tags, which define the cells, are nested inside `<tr>` tags.
28+
29+
The HTML contains all the information we'd need if we wanted to read the animal data into R, but we'll need rvest to extract the table and turn it into a data frame.
30+
31+
### CSS
32+
33+
CSS (_Cascading Style Sheets_) defines the appearance of HTML elements. _CSS selectors_ are often used to style particular subsets of elements, but you can also use them to extract elements from a web page.
34+
35+
CSS selectors often reflect the structure of the web page. For example, the CSS selector for the example page's heading is
36+
37+
`body > h1`
38+
39+
and the selector for the entire table is
40+
41+
`body > table`
42+
43+
You don't need to generate CSS selectors yourself. In the next section, we'll show you how to use your browser to figure out the correct selector.
44+
45+
## Scrape data with rvest
46+
47+
[Our World in Data](https://ourworldindata.org) compiled data on world famines and made it available in [a table](https://ourworldindata.org/famines#the-our-world-in-data-dataset-of-famines).
48+
49+
```{r echo=FALSE}
50+
knitr::include_graphics("images/rvest/famines-data.png", dpi = image_dpi)
51+
```
52+
53+
Using this table as an example, we'll show you how to use rvest to scrape a web page's HTML, read in a particular element, and then convert HTML to a data frame.
54+
55+
### Read HTML
56+
57+
First, copy the url of the webpage and store it in a parameter.
58+
59+
```{r}
60+
url_data <- "https://ourworldindata.org/famines"
61+
```
62+
63+
Next, use `rvest::read_html()` to read all of the HTML into R.
64+
65+
```{r}
66+
url_data %>%
67+
read_html()
68+
```
69+
70+
`read_html()` reads in all the html for the page. The page contains far more information than we need, so next we'll extract just the famines data table.
71+
72+
### Find the CSS selector
73+
74+
We'll find the CSS selector of the famines table and then use that selector to extract the data.
75+
76+
In Chrome, right click on a cell near the top of the table. ^[The process is very similar in Safari.]
77+
78+
```{r echo=FALSE}
79+
knitr::include_graphics("images/rvest/famines-inspect.png", dpi = image_dpi)
80+
```
81+
82+
The developer console will open and highlight the HTML element corresponding to the cell you clicked.
83+
84+
```{r echo=FALSE}
85+
knitr::include_graphics(
86+
"images/rvest/famines-inspect-developer.png",
87+
dpi = image_dpi
88+
)
89+
```
90+
91+
Hovering over different HTML elements in the _Elements_ pane will highlight different parts of the webpage.
92+
93+
```{r echo=FALSE}
94+
knitr::include_graphics("images/rvest/famines-hover.png", dpi = image_dpi)
95+
```
96+
97+
Move your mouse up the HTML document, hovering over different lines until the entire table (and only the table) is highlighted. This will often be a line with a `<table>` tag.
98+
99+
```{r echo=FALSE}
100+
knitr::include_graphics(
101+
"images/rvest/famines-highlight-table.png",
102+
dpi = image_dpi
103+
)
104+
```
105+
106+
Right click on the line, then click _Copy_ > _Copy selector_.
107+
108+
```{r echo=FALSE}
109+
knitr::include_graphics(
110+
"images/rvest/famines-copy-selector.png",
111+
dpi = image_dpi
112+
)
113+
```
114+
115+
Return to RStudio, create a variable for your CSS selector, and paste in the selector you copied.
116+
117+
```{r}
118+
css_selector <- "#tablepress-73"
119+
```
120+
121+
### Extract the table
122+
123+
You already saw how to read HTML into R with `rvest::read_html()`. Next, use `rvest::html_node()` to select just the element identified by your CSS selector.
124+
125+
```{r}
126+
url_data <- "https://ourworldindata.org/famines"
127+
css_selector <- "#tablepress-73"
128+
129+
url_data %>%
130+
read_html() %>%
131+
html_node(css = css_selector)
132+
```
133+
134+
The data is still in HTML. Use `rvest::html_table()` to turn the output into a data frame.
135+
136+
```{r}
137+
url_data %>%
138+
read_html() %>%
139+
html_node(css = css_selector) %>%
140+
html_table()
141+
```
142+
143+
Now, the data is ready for wrangling in R.
144+
145+
Note that `html_table()` will only work if the HTML element you've supplied is a table. If, for example, wanted to extract a paragraph of text, we'd use `html_text()` instead.
146+
147+
```{r}
148+
css_selector_paragraph <-
149+
"body > main > article > div.content-wrapper > div.offset-content > div > div > section:nth-child(1) > div > div:nth-child(1) > p:nth-child(9)"
150+
151+
url_data %>%
152+
read_html() %>%
153+
html_node(css = css_selector_paragraph) %>%
154+
html_text()
155+
```
156+

0 commit comments

Comments
 (0)