dcl-docs
diff --git a/‎_bookdown.yml
+1 b/‎_bookdown.yml
+1
diff --git a/‎_common.R
+1 b/‎_common.R
+1
diff --git a/‎examples/html-example.html
+38 b/‎examples/html-example.html
+38
diff --git a/‎images/rvest/famines-copy-selector.png
2.79 MB b/‎images/rvest/famines-copy-selector.png
2.79 MB
diff --git a/‎images/rvest/famines-data.png
1.68 MB b/‎images/rvest/famines-data.png
1.68 MB
diff --git a/‎images/rvest/famines-highlight-table.png
2.65 MB b/‎images/rvest/famines-highlight-table.png
2.65 MB
diff --git a/‎images/rvest/famines-hover.png
2.49 MB b/‎images/rvest/famines-hover.png
2.49 MB
diff --git a/‎images/rvest/famines-inspect-developer.png
2.52 MB b/‎images/rvest/famines-inspect-developer.png
2.52 MB
diff --git a/‎images/rvest/famines-inspect.png
1.76 MB b/‎images/rvest/famines-inspect.png
1.76 MB
diff --git a/‎images/rvest/html-example.png
831 KB b/‎images/rvest/html-example.png
831 KB
diff --git a/‎web-scraping.Rmd
+156 b/‎web-scraping.Rmd
+156
@@ -18,6 +18,7 @@ rmd_files: [
   "census.Rmd",
   "googlesheets.Rmd",
   "read-write.Rmd",
+  "web-scraping.Rmd",
   "references.Rmd"
 ]
 
 
@@ -8,6 +8,7 @@ options(
 
 knitr::opts_chunk$set(
   comment = "#>",
+  collapse = FALSE,
   fig.align = 'center',
   fig.asp = 0.618,  # 1 / phi
   fig.show = "hold"
 
@@ -0,0 +1,38 @@
+<!doctype html>
+<html>
+
+<head>
+  <title>Page Title!</title>
+</head>
+
+<body>
+  
+  <h1>A page heading!</h1>
+
+  <p>This is a paragraph full of words.</p>
+
+  <p>Another paragraph, full of words.</p>
+  
+  <h2>A table</h2>
+    
+  <table>
+   <thead>
+    <tr>
+     <th style="text-align:left;"> animal </th>
+     <th style="text-align:right;"> n </th>
+    </tr>
+   </thead>
+   <tbody>
+    <tr>
+     <td style="text-align:left;"> alpaca </td>
+     <td style="text-align:right;"> 3 </td>
+    </tr>
+    <tr>
+     <td style="text-align:left;"> llama </td>
+     <td style="text-align:right;"> 8 </td>
+    </tr>
+   </tbody>
+  </table>
+
+</body>
+</html>
@@ -0,0 +1,156 @@
+
+# (PART) Web scraping {-} 
+
+# rvest
+
+```{r warning=FALSE, message=FALSE}
+library(rvest)
+library(tidyverse)
+```
+
+The [rvest package](https://rvest.tidyverse.org/index.html) (as in "harvest") allows you to scrape information from a web page and read it into R. In this chapter, we'll explain the basics of rvest and walk you through an example. 
+
+## Web page basics
+
+### HTML
+
+HTML (_Hyper Text Markup Language_) defines the content and structure of a web page. In Chrome, you can view the HTML that generates a given web page by navigating to _View_ > _Developer_ > _Developer tools_.
+
+A series of elements, like paragraphs, headers, and tables, make up every HTML page. Here's a very simple webpage and the HTML that generates it. 
+
+```{r echo=FALSE}
+knitr::include_graphics("images/rvest/html-example.png", dpi = image_dpi)
+```
+
+The words surrounded by `< >` are HTML _tags_. Tags define where an element starts and ends. Elements, like paragraph (`<p>`), headings (`<h1>`), and tables (`<table>`), start with an opening tag (`<tagname>`) and end with the corresponding closing tag (`</tagname>`).
+
+Elements can be nested inside other elements. For example, notice that the `<tr>` tags,  which generate rows of a table, are nested inside the `<table>` tag, and the `<td>` tags, which define the cells, are nested inside `<tr>` tags. 
+
+The HTML contains all the information we'd need if we wanted to read the animal data into R, but we'll need rvest to extract the table and turn it into a data frame.
+
+### CSS
+
+CSS (_Cascading Style Sheets_) defines the appearance of HTML elements. _CSS selectors_ are often used to style particular subsets of elements, but you can also use them to extract elements from a web page. 
+
+CSS selectors often reflect the structure of the web page. For example, the CSS selector for the example page's heading is
+
+`body > h1`
+
+and the selector for the entire table is 
+
+`body > table`
+
+You don't need to generate CSS selectors yourself. In the next section, we'll show you how to use your browser to figure out the correct selector.
+
+## Scrape data with rvest
+
+[Our World in Data](https://ourworldindata.org) compiled data on world famines and made it available in [a table](https://ourworldindata.org/famines#the-our-world-in-data-dataset-of-famines). 
+
+```{r echo=FALSE}
+knitr::include_graphics("images/rvest/famines-data.png", dpi = image_dpi)
+```
+
+Using this table as an example, we'll show you how to use rvest to scrape a web page's HTML, read in a particular element, and then convert HTML to a data frame.
+
+### Read HTML
+
+First, copy the url of the webpage and store it in a parameter.
+
+```{r}
+url_data <- "https://ourworldindata.org/famines"
+```
+
+Next, use `rvest::read_html()` to read all of the HTML into R.
+
+```{r}
+url_data %>% 
+  read_html()
+```
+
+`read_html()` reads in all the html for the page. The page contains far more information than we need, so next we'll extract just the famines data table. 
+
+### Find the CSS selector
+
+We'll find the CSS selector of the famines table and then use that selector to extract the data. 
+
+In Chrome, right click on a cell near the top of the table. ^[The process is very similar in Safari.]
+
+```{r echo=FALSE}
+knitr::include_graphics("images/rvest/famines-inspect.png", dpi = image_dpi)
+```
+
+The developer console will open and highlight the HTML element corresponding to the cell you clicked.
+
+```{r echo=FALSE}
+knitr::include_graphics(
+  "images/rvest/famines-inspect-developer.png", 
+  dpi = image_dpi
+)
+```
+
+Hovering over different HTML elements in the _Elements_ pane will highlight different parts of the webpage.
+
+```{r echo=FALSE}
+knitr::include_graphics("images/rvest/famines-hover.png", dpi = image_dpi)
+```
+
+Move your mouse up the HTML document, hovering over different lines until the entire table (and only the table) is highlighted. This will often be a line with a `<table>` tag.
+
+```{r echo=FALSE}
+knitr::include_graphics(
+  "images/rvest/famines-highlight-table.png", 
+  dpi = image_dpi
+)
+```
+
+Right click on the line, then click _Copy_ > _Copy selector_.
+
+```{r echo=FALSE}
+knitr::include_graphics(
+  "images/rvest/famines-copy-selector.png", 
+  dpi = image_dpi
+)
+```
+
+Return to RStudio, create a variable for your CSS selector, and paste in the selector you copied.
+
+```{r}
+css_selector <- "#tablepress-73"
+```
+
+### Extract the table
+
+You already saw how to read HTML into R with `rvest::read_html()`. Next, use `rvest::html_node()` to select just the element identified by your CSS selector. 
+
+```{r}
+url_data <- "https://ourworldindata.org/famines"
+css_selector <- "#tablepress-73"
+
+url_data %>% 
+  read_html() %>% 
+  html_node(css = css_selector) 
+```
+
+The data is still in HTML. Use `rvest::html_table()` to turn the output into a data frame.
+
+```{r}
+url_data %>% 
+  read_html() %>% 
+  html_node(css = css_selector) %>% 
+  html_table() 
+```
+
+Now, the data is ready for wrangling in R.
+
+Note that `html_table()` will only work if the HTML element you've supplied is a table. If, for example, wanted to extract a paragraph of text, we'd use `html_text()` instead.
+
+```{r}
+css_selector_paragraph <- 
+  "body > main > article > div.content-wrapper > div.offset-content > div > div > section:nth-child(1) > div > div:nth-child(1) > p:nth-child(9)"
+
+url_data %>% 
+  read_html() %>% 
+  html_node(css = css_selector_paragraph) %>% 
+  html_text()
+```
+
Original file line number	Diff line number	Diff line change
`@@ -18,6 +18,7 @@ rmd_files: [`
`18`	`18`	`"census.Rmd",`
`19`	`19`	`"googlesheets.Rmd",`
`20`	`20`	`"read-write.Rmd",`
	`21`	`+ "web-scraping.Rmd",`
`21`	`22`	`"references.Rmd"`
`22`	`23`	`]`
`23`	`24`