rvest.Rmd


# (PART) Web scraping {-} 

# rvest

```{r warning=FALSE, message=FALSE}
library(tidyverse)
library(rvest)
```

The [rvest](https://rvest.tidyverse.org/index.html) package (as in "harvest") allows you to scrape information from a web page and read it into R. In this chapter, we'll explain the basics of rvest and walk you through an example. 

## Web page basics

### HTML

HTML (_Hyper Text Markup Language_) defines the content and structure of a web page. In Chrome, you can view the HTML that generates a given web page by navigating to _View_ > _Developer_ > _Developer tools_.

A series of elements, like paragraphs, headers, and tables, make up every HTML page. Here's a very simple web page and the HTML that generates it. 

```{r echo=FALSE}
knitr::include_graphics("images/rvest/html-example.png", dpi = image_dpi)
```

The words surrounded by `< >` are HTML _tags_. Tags define where an element starts and ends. Elements, like paragraph (`<p>`), headings (`<h1>`), and tables (`<table>`), start with an opening tag (`<tagname>`) and end with the corresponding closing tag (`</tagname>`).

Elements can be nested inside other elements. For example, notice that the `<tr>` tags,  which generate rows of a table, are nested inside the `<table>` tag, and the `<td>` tags, which define the cells, are nested inside `<tr>` tags. 

The HTML contains all the information we'd need if we wanted to read the animal data into R, but we'll need rvest to extract the table and turn it into a data frame.

### CSS

CSS (_Cascading Style Sheets_) defines the appearance of HTML elements. _CSS selectors_ are often used to style particular subsets of elements, but you can also use them to extract elements from a web page. 

CSS selectors often reflect the structure of the web page. For example, the CSS selector for the example page's heading is

`body > h1`

and the selector for the entire table is 

`body > table`

You don't need to generate CSS selectors yourself. In the next section, we'll show you how to use your browser to figure out the correct selector.

## Scrape data with rvest

[Our World in Data](https://ourworldindata.org) compiled data on world famines and made it available in [a table](https://ourworldindata.org/famines#the-our-world-in-data-dataset-of-famines). 

```{r echo=FALSE}
knitr::include_graphics("images/rvest/famines-data.png", dpi = image_dpi)
```

Using this table as an example, we'll show you how to use rvest to scrape a web page's HTML, read in a particular element, and then convert HTML to a data frame.

### Read HTML

First, copy the url of the web page and store it in a parameter.

```{r}
url_data <- "https://ourworldindata.org/famines"
```

Next, use `rvest::read_html()` to read all of the HTML into R.

```{r}
url_data %>% 
  read_html()
```

`read_html()` reads in all the HTML for the page. The page contains far more information than we need, so next we'll extract just the famines data table. 

### Find the CSS selector

We'll find the CSS selector of the famines table and then use that selector to extract the data. 

In Chrome, right click on a cell near the top of the table, then click _Inspect_ (or _Inspect element_ in Safari or Firefox).

```{r echo=FALSE}
knitr::include_graphics("images/rvest/famines-inspect.png", dpi = image_dpi)
```

The developer console will open and highlight the HTML element corresponding to the cell you clicked.

```{r echo=FALSE}
knitr::include_graphics(
  "images/rvest/famines-inspect-developer.png", 
  dpi = image_dpi
)
```

Hovering over different HTML elements in the _Elements_ pane will highlight different parts of the web page.

```{r echo=FALSE}
knitr::include_graphics("images/rvest/famines-hover.png", dpi = image_dpi)
```

Move your mouse up the HTML document, hovering over different lines until the entire table (and only the table) is highlighted. This will often be a line with a `<table>` tag.

```{r echo=FALSE}
knitr::include_graphics(
  "images/rvest/famines-highlight-table.png", 
  dpi = image_dpi
)
```

Right click on the line, then click _Copy_ > _Copy selector_ (Firefox: _Copy_ > _CSS selector_; Safari: _Copy_ > _Selector Path_).

```{r echo=FALSE}
knitr::include_graphics(
  "images/rvest/famines-copy-selector.png", 
  dpi = image_dpi
)
```

Return to RStudio, create a variable for your CSS selector, and paste in the selector you copied.

```{r}
css_selector <- "#tablepress-73"
```

### Extract the table

You already saw how to read HTML into R with `rvest::read_html()`. Next, use `rvest::html_element()` to select just the element identified by your CSS selector. 

```{r}
url_data %>% 
  read_html() %>% 
  html_element(css = css_selector) 
```

The data is still in HTML. Use `rvest::html_table()` to turn the output into a tibble.

```{r}
url_data %>% 
  read_html() %>% 
  html_element(css = css_selector) %>% 
  html_table()
```

Now, the data is ready for wrangling in R.

Note that `html_table()` will only work if the HTML element you've supplied is a table. If, for example, we wanted to extract a paragraph of text, we'd use `html_text()` instead.

```{r}
css_selector_paragraph <- 
  "body > main > article > div.content-wrapper > div.offset-content > div > div > section:nth-child(1) > div > div:nth-child(1) > p:nth-child(9)"

url_data %>% 
  read_html() %>% 
  html_element(css = css_selector_paragraph) %>% 
  html_text()
```