forked from jennybc/purrr-tutorial
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathex26_ny-food-market-json.Rmd
370 lines (275 loc) · 14.3 KB
/
ex26_ny-food-market-json.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
---
title: "Food Markets in New York"
comment: "*from JSON to data frame*"
output:
html_document:
toc: true
toc_float: true
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
Mara Averick [recently tweeted](https://twitter.com/dataandme/status/804385103178825729) about a blog post from February 2015 by Zev Ross, demonstrating the conversion of JSON data into a data frame ([Using R to download and parse JSON: an example using data from an open data portal](http://zevross.com/blog/2015/02/12/using-r-to-download-and-parse-json-an-example-using-data-from-an-open-data-portal/)). I thought it would be interesting to revisit this data wrangling task with purrr, which has only been on CRAN since fall 2015.
Zev is dealing with data on food markets in New York State that, sadly but realistically, cannot be well-represented in simple rectangular form, such as CSV. Therefore the data is available as JSON and it's our job to turn it into a useful data frame. This is a very, very common task and a typical use case for purrr.
*I downloaded and unpacked the zip archive provided by Zev: [foodMarkets.zip](http://zevross.com/blog/wp-content/uploads/2015/02/foodMarkets.zip]). These files are available in this tutorial's repo in an unpacked state: [foodMarkets](https://github.com/jennybc/purrr-tutorial/tree/gh-pages/foodMarkets).*
## Load packages and the data
Packages used below:
* jsonlite for parsing JSON
* purrr for obvious reasons
* listviewer for doing recon on awkard lists
* tibble and dplyr for forming and manipulating data frames, especially the "tibble" variant (class `tbl_df`), which is friendly to list-columns.
```{r}
library(jsonlite)
library(purrr)
library(listviewer)
library(tibble)
library(dplyr)
food_mkts_raw <- fromJSON("foodMarkets/retail_food_markets.json",
simplifyVector = FALSE)
```
## Voyage of discovery: `data`
When you first import JSON, it is common to have no clue what's in there. You must explore.
```{r}
str(food_mkts_raw, max.level = 1)
```
Ooh, a `data` component sounds promising!
```{r}
str(food_mkts_raw$data, max.level = 1, list.len = 5)
```
Working hypothesis: we have data of similar form for each of `r length(food_mkts_raw$data)` food markets. A slightly mysterious, but homogeneous list.
Pull out just the `data` component and store as `food_mkts`.
```{r}
food_mkts <- food_mkts_raw[["data"]]
```
## Voyage of discovery: `meta` and `columns`
Before we move on, what's in the `meta` component of the raw JSON? It turns out we should focus on its sole component, which is named `view`.
```{r}
jsonedit(food_mkts_raw[[c("meta", "view")]])
```
The `columns` component is especially valuable, because it tells us what data we have for each food market over in the `data` component. This is an example of a common but, in my opinion, dangerous practice of storing information with no explanatory names and providing a separate lexicon. This is kind of like a CSV with variables named `X1` through `Xn` or no names at all.
We need to pull out the `name` elements of `columns`. We kick off with `[[` vector indexing into a nested list, to drill down to the `columns` sub-component. Then `map_chr(..., "name")` to grab the `name`s as character vector.
```{r}
(cnames <- food_mkts_raw[[c("meta", "view", "columns")]] %>%
map_chr("name"))
```
## Apply names
Downstream work will be easier if the food market elements bear the names in `cnames`. I believe this is how the JSON should have been to begin with. This makes exploration and troubleshooting easier and, more importantly, it helps in our goal to create a data frame.
```{r}
food_mkts <- food_mkts %>%
map(set_names, cnames)
```
## One variable
As a warm-up activity, pull out and simplify the data for one column, across all food markets. From Zev's post, I decided to look at `DBA Name` ("doing business as").
```{r}
food_mkts %>%
map_chr("DBA Name") %>%
head()
```
I think we'll want to trim whitespace at some point!
## Data frame of (almost) everything
Again, freeriding on Zev's post, I know that 22 of the 23 variables can be extracted in a fairly simple way. Everything but `Location`, which holds some unparsed JSON we must tackle separately.
Create a vector of our targetted variable names:
```{r}
(to_process <- cnames[cnames != "Location"])
```
**Goal: create a data frame with these variables and one row per food market.**
### "Rows first"
Voice from the future: not all of these 22 variables are populated for every food market. There are explicit `NULL`s. Zev handled this with a two-layered approach: catch the `NULL`s and replace with `NA`s, prior to combining with data from other markets via `sapply()` and `data.frame()`.
purrr's handling of void, explicit `NULL`, and implicit `NULL` is quite good, but doesn't address this case either. Therefore I too must fanny around with `NULL`s before I can make my beautiful data frame.
Ideally, I would use `map_df()` like this for food markets instead of just the first 3:
```{r}
food_mkts[1:3] %>%
map_df(`[`, to_process)
```
But I have learned the hard way that the `NULL`s are a dealkiller.
How does `map_df()` actually work? First, it calls `map()` and, in my application, gets the sub-list of non-`Location` variables for each food market. Then a list of these lists is shipped off to `dplyr::bind_rows()` for row-binding and type conversion. But the presence of `NULL`s breaks this workflow when there are markets, like the one below, that lack, say, a street number. Basically most of R's data-frame-making facilities balk at `NULL`s. Here is some exploration of that.
```{r error = TRUE}
food_mkts[[67]][14:16]
data.frame(food_mkts[[67]][14:16])
food_mkts[66:68] %>%
map_df(`[`, to_process)
```
I want to rescue this workflow, because the automatic type conversion from `dplyr::bind_rows()` is very appealing.
Therefore, like Zev, I use a custom function `safe_extract()`. It's a slightly expanded version of `[` that replaces `NULL` with `NA`.
```{r}
safe_extract <- function(l, wut) {
res <- l[wut]
null_here <- map_lgl(res, is.null)
res[null_here] <- NA
res
}
safe_extract(food_mkts[[67]][14:16])
```
Now I can get a data frame easily.
```{r}
(mkts_df <- food_mkts %>%
map_df(safe_extract, to_process))
```
Victory is ours. Well, partial victory. We still need to deal with `Location`.
### "Rows first" Recap
Self-contained code to go from raw JSON to data frame for 22 out of 23 variables, relying on automatic type conversion.
```{r eval = FALSE}
food_mkts_raw <- fromJSON("foodMarkets/retail_food_markets.json",
simplifyVector = FALSE)
cnames <- food_mkts_raw[[c("meta", "view", "columns")]] %>%
map_chr("name")
food_mkts <- food_mkts_raw[["data"]] %>%
map(set_names, cnames)
to_process <- cnames[cnames != "Location"]
safe_extract <- function(l, wut) {
res <- l[wut]
null_here <- map_lgl(res, is.null)
res[null_here] <- NA
res
}
mkts_df <- food_mkts %>%
map_df(safe_extract, to_process)
```
### "Columns first"
The more proper workflow is to build the columns first, with purrr's type-specific mapping functions, then put them into a data frame. This is safer because it forces you to articulate and enforce type expectations for each variable.
* Good news: there's a built-in way to handle explicit `NULL`s.
* Bad news: we need to specify type for each of the 22 variables.
* ¯\\\_(ツ)\_/¯ news: we get to use more exotic features of purrr.
Here's a naïve approach for two variables:
```{r}
food_mkts %>% {
tibble(
dba_name = map_chr(., "DBA Name"),
city = map_chr(., "City")
)
}
```
Doing this for 22 variables is a drag, though it's certainly possible, and even reasonable for the ingest of an important dataset.
However, because I can, I store the type of each variable and do extraction + type conversion programmatically. This lets us explore more of the functional programming side of purrr.
I exploit the variable types learned from the automatic type conversion in the "rows first" approach. Here's a data frame with one row per variable, providing the expected class and the type-appropriate `map_*()` and `NA`. It will soon become clear why I use the specific variable names, `.f` and `.null`.
```{r}
(vdf <- mkts_df %>%
map_chr(class) %>%
enframe(name = ".f", value = "type") %>%
mutate(vname = .f %>% tolower() %>% gsub("\\s+", "_", .),
mapper = c(integer = "map_int", character = "map_chr")[type],
.null = list(integer = NA_integer_,
character = NA_character_)[type]))
```
Here's how I could use `do.call()` to create one variable based on one row of this data frame.
```{r}
i <- 14 # DBA Name
vdf[i, ]
do.call(vdf$mapper[i],
list(.x = food_mkts, .f = vdf$.f[i], .null = vdf$.null[[i]])) %>%
head()
```
`do.call()` is a base R function that constructs and executes a function call from a function (name or actual function) and a list of arguments. It's exactly what we need. But we also want to do this for each of the 22 variables.
How to scale up? We use the `invoke()` family of functions in purrr. `invoke()` itself is a pipe-friendly wrapper around `do.call()`. `invoke_map()` and friends allow you to map over functions and arguments in parallel, analagous to `map2()`. The big difference with `invoke_map()` is that the two primary inputs are a vector of functions and a vector of argument values. We use `invoke_map_df()` here, which is the variant that requests to get a data frame back.
```{r}
(mkts_df <- invoke_map_df(.f = set_names(vdf$mapper, vdf$vname),
.x = transpose(vdf[c(".f", ".null")]),
food_mkts))
```
Partial victory is ours again! A data frame, one row per food market, with 22 variables. How did this work? Line-by-line:
* `.f = set_names(vdf$mapper, vdf$vname)` specifies the list of functions to iterate over: the type-specific mappers we pre-selected above. I bother to apply names here because they propagate to the variable names of the data frame.
* `.x = transpose(vdf[c(".f", ".null")])` gives the parallel list of function arguments, namely the strings specifying named elements to extract and the value to insert in place of explicit `NULL`. Two things that might be confusing:
- This instance of `.f` is at the variable level, i.e. inside our iteration. In the previous line, `.f` refers to the overall operation, i.e. setting up our iteration.
- `transpose()` is needed to repackage the `.f` and `.null` variables. Instead of a list of two same-length vectors, we want a single list holding lists of length two.
* `food_mkts` is the `...` argument. It's a common input across all units of iteration.
### "Columns first" Recap
Self-contained code to go from raw JSON to data frame for 22 out of 23 variables, with explicit type specification.
```{r eval = FALSE}
food_mkts_raw <- fromJSON("foodMarkets/retail_food_markets.json",
simplifyVector = FALSE)
cnames <- food_mkts_raw[[c("meta", "view", "columns")]] %>%
map_chr("name")
food_mkts <- food_mkts_raw[["data"]] %>%
map(set_names, cnames)
vdf <- tribble(
~ .f, ~ type,
"sid", "integer",
"id", "character",
"position", "integer",
"created_at", "integer",
"created_meta", "character",
"updated_at", "integer",
"updated_meta", "character",
"meta", "character",
"County", "character",
"License Number", "character",
"Operation Type", "character",
"Estab Type", "character",
"Entity Name", "character",
"DBA Name", "character",
"Street Number", "character",
"Street Name", "character",
"Address Line 2", "character",
"Address Line 3", "character",
"City", "character",
"State", "character",
"Zip Code", "character",
"Square Footage", "character")
vdf <- vdf %>%
mutate(vname = .f %>% tolower() %>% gsub("\\s+", "_", .),
mapper = c(integer = "map_int", character = "map_chr")[type],
.null = list(integer = NA_integer_,
character = NA_character_)[type])
mkts_df <- invoke_map_df(.f = set_names(vdf$mapper, vdf$vname),
.x = transpose(vdf[c(".f", ".null")]),
food_mkts)
```
## Rescue `Location`
We need to process the 23rd variable, `Location`. There are no big new principles here, so this will heavy on the code and light on explanation.
### Prepare `Location`
Pull `Location` into its own list and inspect one element.
```{r}
loc_raw <- food_mkts %>%
map("Location")
loc_raw[[345]]
```
Ah, more key-less JSON, giving rise to name-less lists. Before we do anything else, extract names from the `meta` component and apply them. Re-inspect an element to verify we've succeeded.
```{r}
i <- which(cnames == "Location")
location_meta <- food_mkts_raw[[c("meta", "view", "columns")]][[i]]
(lnames <- location_meta[["subColumnTypes"]] %>% flatten_chr())
loc_raw <- loc_raw %>%
map(set_names, lnames)
loc_raw[[345]]
```
### Parse the JSON string holding human address
The `human_address` element of each location holds a JSON string:
* Parse it.
* Rowbind elements into a data frame.
* Prepend `ha_` to the variable names to avoid name conflicts downstream.
```{r}
ha <- loc_raw %>%
map("human_address") %>%
map_df(fromJSON) %>%
set_names(paste0("ha_", names(.)))
head(ha)
```
### Convert remaining variables into a data frame
Convert everything else left in the location data to a data frame. Convert `latitude` and `longitude` into numeric variables and make sure they don't look insane. [This New York latitude and longitude map](http://www.mapsofworld.com/usa/states/new-york/lat-long.html) indicates that latitude should range between 40° and 50° and longitude between -80° and -72°.
```{r}
ee <- loc_raw %>%
map_df(safe_extract, lnames[lnames != "human_address"]) %>%
mutate_at(vars(latitude, longitude), as.numeric)
head(ee)
ee %>%
select(latitude, longitude) %>%
map(summary)
```
## Put everything together
Column bind the main data frame `mkts_df` with the human address data in `ha` and everything else from `Location`.
```{r}
(mkts_df <- bind_cols(mkts_df, ha, ee))
```
Trim any leading/trailing whitespace from the character variables.
```{r}
mkts_df <- mkts_df %>%
mutate_if(is.character, trimws)
```
Take a look at some of the most interpretable variables.
```{r}
mkts_df %>%
select(dba_name, city, latitude, longitude, square_footage)
```