forked from jennybc/purrr-tutorial
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathls13_list-columns.Rmd
338 lines (258 loc) · 9.43 KB
/
ls13_list-columns.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
---
title: "List columns"
comment: "*creating, managing, and eliminating list-columns*"
output:
html_document:
toc: true
toc_float: true
---
```{r, echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```
Data frames are a fantastic data structure for data analysis. We usually think of them as a data receptacle for several atomic vectors with a common length and with a notion of "observation", i.e. the i-th value of each atomic vector is related to all the other i-th values.
But data frame are not limited to atomic vectors. They can host general vectors, i.e. *lists* as well. This is what I call a **list-column**.
List-columns and the data frame that hosts them require some special handling. In particular, it is highly advantageous if the data frame is a [tibble](https://github.com/tidyverse/tibble#readme), which anticipates list-columns. To work comfortably with list-columns, you need to develop techniques to:
* **Inspect**. What have I created?
* **Index**. How do I pull out specific bits by name or position?
* **Compute**. How do I operate on my list-column to make another vector or list-column?
* **Simplify**. How do I get rid of this list-column and back to a normal data frame?
The purrr package and all the techniques depicted in the other lessons come into heavy play here. This is a collection of worked examples that show these techniques applied specifically to list-columns.
## Regex and Trump tweets
### Load packages
```{r message = FALSE}
library(tidyverse)
library(stringr)
library(lubridate)
library(here) ## install.packages("krlmlr/here")
```
### Bring tweets in
Working with the same 7 tweets as [Trump Android words](ls08_trump-tweets.html) lesson. Go there for the rationale for choosing these 7 tweets.
```{r}
tb_raw <- read_csv(here("talks", "trump-tweets.csv"))
```
### Create a list-column of Trump Android words
Clean a variable and create a list-column:
* `source` comes in an unfriendly form. Simplify to convey if tweet came from Android or iPhone.
* `twords` are what we'll call the "Trump Android words". See [Trump Android words](ls08_trump-tweets.html) lesson for backstory. **This is a list-column!**
```{r}
source_regex <- "android|iphone"
tword_regex <- "badly|crazy|weak|spent|strong|dumb|joke|guns|funny|dead"
tb <- tb_raw %>%
mutate(source = str_extract(source, source_regex),
twords = str_extract_all(tweet, tword_regex))
```
### Derive new variables
Add variables, two of which are based on the `twords` list-column.
* `n`: How many twords are in the tweet?
* `hour`: At which hour of the day was the tweet?
* `start`: Start character of each tword.
```{r}
tb <- tb %>%
mutate(n = lengths(twords),
hour = hour(created),
start = gregexpr(tword_regex, tweet))
```
```{r include = FALSE}
# another possibilty that would require more processing
# so less useful for a talk example
# but more useful IRL:
# str_locate_all(tweet, tword_regex))
```
### Use regular data manipulation toolkit
Let's isolate tweets created before 2pm, containing 1 or 2 twords, in which there's an tword that starts within the first 30 characters.
```{r}
tb %>%
filter(hour < 14,
between(n, 1, 2),
between(map_int(start, min), 0, 30))
```
Let's isolate tweets that contain both the twords "strong" and "weak".
```{r}
tb %>%
filter(map_lgl(twords, ~ all(c("strong", "weak") %in% .x)))
```
## JSON from an API and Game of Thrones
### Load packages
```{r}
library(repurrrsive)
library(tidyverse)
library(httr)
library(stringr)
library(here)
```
### Call the API of Ice and Fire
Here's a simplified version of how we obtained the data on the Game of Thrones POV characters. This data appears as a more processed list in the [repurrrsive](https://github.com/jennybc/repurrrsive#readme) package.
* Get character IDs from repurrrsive. *cheating a little, humor me*
* Put IDs and character names in a tibble.
```{r}
pov <- set_names(map_int(got_chars, "id"),
map_chr(got_chars, "name"))
tail(pov, 5)
ice <- pov %>%
enframe(value = "id")
ice
```
Request info for each character and store what comes back -- whatever that may be -- in the list-column `stuff`.
```{r}
ice_and_fire_url <- "https://anapioficeandfire.com/"
if (file.exists(here("talks", "ice.rds"))) {
ice <- readRDS(here("talks", "ice.rds"))
} else {
ice <- ice %>%
mutate(
response = map(id,
~ GET(ice_and_fire_url,
path = c("api", "characters", .x))),
stuff = map(response, ~ content(.x, as = "parsed",
simplifyVector = TRUE))
) %>%
select(-id, -response)
saveRDS(ice, here("talks", "ice.rds"))
}
ice
```
Let's switch to a nicer version of `ice`, based on the list in repurrrsive, because it already has books and houses replaced with names instead of URLs.
```{r}
ice2 <- tibble(
name = map_chr(got_chars, "name"),
stuff = got_chars
)
ice2
```
Inspect the list-column.
```{r}
str(ice2$stuff[[9]], max.level = 1)
# if (interactive()) {
# listviewer::jsonedit(ice2$stuff[[2]], mode = "view", width = 500, height = 530)
# }
```
### Use regular data manipulation toolkit
Form a sentence of the form "NAME was born AT THIS TIME, IN THIS PLACE" by digging info out of the `stuff` list-column and placing into a string template. No list-columns left!
```{r}
template <- "${name} was born ${born}."
birth_announcements <- ice2 %>%
mutate(birth = map_chr(stuff, str_interp, string = template)) %>%
select(-stuff)
birth_announcements
```
Extract each character's house allegiances. Keep only those with more than one allegiance. Then unnest to explode the `houses` list-column and get a tibble with one row per character * house combination. No list-columns left!
```{r}
allegiances <- ice2 %>%
transmute(name,
houses = map(stuff, "allegiances")) %>%
filter(lengths(houses) > 1) %>%
unnest()
allegiances
```
## Aliases and allegiances of Game of Thrones characters
### Load packages
```{r}
library(tidyverse)
## install_github("jennybc/repurrrsive")
library(repurrrsive)
library(stringr)
```
## Lists as variables in a data frame
One row per GoT character. List columns for aliases and allegiances.
```{r}
x <- tibble(
name = got_chars %>% map_chr("name"),
aliases = got_chars %>% map("aliases"),
allegiances = got_chars %>% map("allegiances")
)
x
#View(x)
```
What if we only care about characters with a "Lannister" alliance? Practice operating on a list-column.
```{r}
x %>%
mutate(lannister = map(allegiances, str_detect, pattern = "Lannister"),
lannister = map_lgl(lannister, any))
```
Keep only the Lannisters and Starks allegiances. You can use `filter()` with list-columns, but you will need to `map()` to list-ize your operation. Once I've got the characters I want, I drop `allegiances` and use `unnest()` to get back to a simple data frame with no list columns.
```{r}
x %>%
filter(allegiances %>%
map(str_detect, "Lannister|Stark") %>%
map_lgl(any)) %>%
select(-allegiances) %>%
filter(lengths(aliases) > 0) %>%
unnest() %>%
print(n = Inf)
```
```{r eval = FALSE, include = FALSE}
x_base <- data.frame(
name = vapply(got_chars, `[[`, character(1), "name"),
aliases = I(lapply(got_chars, `[[`, "aliases")),
allegiances = I(lapply(got_chars, `[[`, "allegiances"))
)
keep1 <- vapply(x_base$allegiances, function(y) any(grepl("Lannister|Stark", y)), logical(1))
x_base <- x_base[keep1, ]
x_base$allegiances <- NULL
x_base
data.frame(
name = rep(x_base$name, lengths(x_base$aliases)),
aliases = unlist(x_base$aliases)
)
```
## Nested data frame, modelling, and Gapminder
Another version of this same example is here:
<http://r4ds.had.co.nz/many-models.html>
*mostly code at this point, more words needed*
### Load packages
```{r}
library(tidyverse)
library(gapminder)
library(broom)
```
### Hello, again, Gapminder
```{r}
gapminder %>%
ggplot(aes(year, lifeExp, group = country)) +
geom_line(alpha = 1/3)
```
What if we fit a line to each country?
```{r}
gapminder %>%
ggplot(aes(year, lifeExp, group = country)) +
geom_line(stat = "smooth", method = "lm",
alpha = 1/3, se = FALSE, colour = "black")
```
What if you actually want those fits? To access estimates, p-values, etc. In that case, you need to fit them yourself. How to do that?
* Put the variables needed for country-specific models into nested dataframe. In a **list-column**!
* Use the usual "map inside mutate", possibly with the broom package, to pull interesting information out of the 142 fitted linear models.
### Nested data frame
Nest the data frames, i.e. get one meta-row per country:
```{r}
gap_nested <- gapminder %>%
group_by(country) %>%
nest()
gap_nested
gap_nested$data[[1]]
```
*Compare/contrast to a data frame grouped by country (dplyr-style) or split on country (base)*.
### Fit models, extract results
Fit a model for each country.
```{r}
gap_fits <- gap_nested %>%
mutate(fit = map(data, ~ lm(lifeExp ~ year, data = .x)))
```
Look at one fitted model, for concreteness.
```{r}
gap_fits %>% tail(3)
canada <- which(gap_fits$country == "Canada")
summary(gap_fits$fit[[canada]])
```
Let's get all the r-squared values!
```{r}
gap_fits %>%
mutate(rsq = map_dbl(fit, ~ summary(.x)[["r.squared"]])) %>%
arrange(rsq)
```
Let's use a function from broom to get the usual coefficient table from `summary.lm()` but in a friendlier form for downstream work.
```{r}
library(broom)
gap_fits %>%
mutate(coef = map(fit, tidy)) %>%
unnest(coef)
```