ex26_ny-food-market-json.Rmd

---
title: "Food Markets in New York"
comment: "*from JSON to data frame*"
output:
  html_document:
    toc: true
    toc_float: true
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Mara Averick [recently tweeted](https://twitter.com/dataandme/status/804385103178825729) about a blog post from February 2015 by Zev Ross, demonstrating the conversion of JSON data into a data frame ([Using R to download and parse JSON: an example using data from an open data portal](http://zevross.com/blog/2015/02/12/using-r-to-download-and-parse-json-an-example-using-data-from-an-open-data-portal/)). I thought it would be interesting to revisit this data wrangling task with purrr, which has only been on CRAN since fall 2015.

Zev is dealing with data on food markets in New York State that, sadly but realistically, cannot be well-represented in simple rectangular form, such as CSV. Therefore the data is available as JSON and it's our job to turn it into a useful data frame. This is a very, very common task and a typical use case for purrr.

*I downloaded and unpacked the zip archive provided by Zev: [foodMarkets.zip](http://zevross.com/blog/wp-content/uploads/2015/02/foodMarkets.zip]). These files are available in this tutorial's repo in an unpacked state: [foodMarkets](https://github.com/jennybc/purrr-tutorial/tree/gh-pages/foodMarkets).*

## Load packages and the data

Packages used below:

  * jsonlite for parsing JSON
  * purrr for obvious reasons
  * listviewer for doing recon on awkard lists
  * tibble and dplyr for forming and manipulating data frames, especially the "tibble" variant (class `tbl_df`), which is friendly to list-columns.

```{r}
library(jsonlite)
library(purrr)
library(listviewer)
library(tibble)
library(dplyr)
food_mkts_raw <- fromJSON("foodMarkets/retail_food_markets.json",
                          simplifyVector = FALSE)
```

## Voyage of discovery: `data`

When you first import JSON, it is common to have no clue what's in there. You must explore.

```{r}
str(food_mkts_raw, max.level = 1)
```

Ooh, a `data` component sounds promising!

```{r}
str(food_mkts_raw$data, max.level = 1, list.len = 5)
```

Working hypothesis: we have data of similar form for each of `r length(food_mkts_raw$data)` food markets. A slightly mysterious, but homogeneous list.

Pull out just the `data` component and store as `food_mkts`.

```{r}
food_mkts <- food_mkts_raw[["data"]]
```

## Voyage of discovery: `meta` and `columns`

Before we move on, what's in the `meta` component of the raw JSON? It turns out we should focus on its sole component, which is named `view`.

```{r}
jsonedit(food_mkts_raw[[c("meta", "view")]])
```

The `columns` component is especially valuable, because it tells us what data we have for each food market over in the `data` component. This is an example of a common but, in my opinion, dangerous practice of storing information with no explanatory names and providing a separate lexicon. This is kind of like a CSV with variables named `X1` through `Xn` or no names at all.

We need to pull out the `name` elements of `columns`. We kick off with `[[` vector indexing into a nested list, to drill down to the `columns` sub-component. Then `map_chr(..., "name")` to grab the `name`s as character vector.

```{r}
(cnames <- food_mkts_raw[[c("meta", "view", "columns")]] %>% 
   map_chr("name"))
```

## Apply names

Downstream work will be easier if the food market elements bear the names in `cnames`. I believe this is how the JSON should have been to begin with. This makes exploration and troubleshooting easier and, more importantly, it helps in our goal to create a data frame.

```{r}
food_mkts <- food_mkts %>% 
  map(set_names, cnames)
```

## One variable

As a warm-up activity, pull out and simplify the data for one column, across all food markets. From Zev's post, I decided to look at `DBA Name` ("doing business as").

```{r}
food_mkts %>% 
  map_chr("DBA Name") %>% 
  head()
```

I think we'll want to trim whitespace at some point!

## Data frame of (almost) everything

Again, freeriding on Zev's post, I know that 22 of the 23 variables can be extracted in a fairly simple way. Everything but `Location`, which holds some unparsed JSON we must tackle separately. 

Create a vector of our targetted variable names:

```{r}
(to_process <- cnames[cnames != "Location"])
```

**Goal: create a data frame with these variables and one row per food market.**

### "Rows first"

Voice from the future: not all of these 22 variables are populated for every food market. There are explicit `NULL`s. Zev handled this with a two-layered approach: catch the `NULL`s and replace with `NA`s, prior to combining with data from other markets via `sapply()` and `data.frame()`.

purrr's handling of void, explicit `NULL`, and implicit `NULL` is quite good, but doesn't address this case either. Therefore I too must fanny around with `NULL`s before I can make my beautiful data frame.

Ideally, I would use `map_df()` like this for food markets instead of just the first 3:

```{r}
food_mkts[1:3] %>% 
  map_df(`[`, to_process)
```

But I have learned the hard way that the `NULL`s are a dealkiller.

How does `map_df()` actually work? First, it calls `map()` and, in my application,  gets the sub-list of non-`Location` variables for each food market. Then a list of these lists is shipped off to `dplyr::bind_rows()` for row-binding and type conversion. But the presence of `NULL`s breaks this workflow when there are markets, like the one below, that lack, say, a street number. Basically most of R's data-frame-making facilities balk at `NULL`s. Here is some exploration of that.

```{r error = TRUE}
food_mkts[[67]][14:16]
data.frame(food_mkts[[67]][14:16])
food_mkts[66:68] %>% 
  map_df(`[`, to_process)
```

I want to rescue this workflow, because the automatic type conversion from `dplyr::bind_rows()` is very appealing.

Therefore, like Zev, I use a custom function `safe_extract()`. It's a slightly expanded version of `[` that replaces `NULL` with `NA`.

```{r}
safe_extract <- function(l, wut) {
  res <- l[wut]
  null_here <- map_lgl(res, is.null)
  res[null_here] <- NA
  res
}
safe_extract(food_mkts[[67]][14:16])
```

Now I can get a data frame easily.

```{r}
(mkts_df <- food_mkts %>% 
    map_df(safe_extract, to_process))
```

Victory is ours. Well, partial victory. We still need to deal with `Location`.

### "Rows first" Recap

Self-contained code to go from raw JSON to data frame for 22 out of 23 variables, relying on automatic type conversion.

```{r eval = FALSE}
food_mkts_raw <- fromJSON("foodMarkets/retail_food_markets.json",
                          simplifyVector = FALSE)
cnames <- food_mkts_raw[[c("meta", "view", "columns")]] %>% 
  map_chr("name")

food_mkts <- food_mkts_raw[["data"]] %>% 
  map(set_names, cnames)

to_process <- cnames[cnames != "Location"]
safe_extract <- function(l, wut) {
  res <- l[wut]
  null_here <- map_lgl(res, is.null)
  res[null_here] <- NA
  res
}
mkts_df <- food_mkts %>% 
  map_df(safe_extract, to_process)
```

### "Columns first"

The more proper workflow is to build the columns first, with purrr's type-specific mapping functions, then put them into a data frame. This is safer because it forces you to articulate and enforce type expectations for each variable.

  * Good news: there's a built-in way to handle explicit `NULL`s.
  * Bad news: we need to specify type for each of the 22 variables.
  * ¯\\\_(ツ)\_/¯ news: we get to use more exotic features of purrr.

Here's a naïve approach for two variables:

```{r}
food_mkts %>% {
  tibble(
    dba_name = map_chr(., "DBA Name"),
    city = map_chr(., "City")
  )
}
```

Doing this for 22 variables is a drag, though it's certainly possible, and even reasonable for the ingest of an important dataset.

However, because I can, I store the type of each variable and do extraction + type conversion programmatically. This lets us explore more of the functional programming side of purrr.

I exploit the variable types learned from the automatic type conversion in the "rows first" approach. Here's a data frame with one row per variable, providing the expected class and the type-appropriate `map_*()` and `NA`. It will soon become clear why I use the specific variable names, `.f` and `.null`.

```{r}
(vdf <- mkts_df %>% 
   map_chr(class) %>%
   enframe(name = ".f", value = "type") %>% 
   mutate(vname = .f %>% tolower() %>% gsub("\\s+", "_", .),
          mapper = c(integer = "map_int", character = "map_chr")[type],
          .null = list(integer = NA_integer_,
                       character = NA_character_)[type]))
```

Here's how I could use `do.call()` to create one variable based on one row of this data frame.

```{r}
i <- 14 # DBA Name
vdf[i, ]
do.call(vdf$mapper[i],
        list(.x = food_mkts, .f = vdf$.f[i], .null = vdf$.null[[i]])) %>% 
  head()
```

`do.call()` is a base R function that constructs and executes a function call from a function (name or actual function) and a list of arguments. It's exactly what we need. But we also want to do this for each of the 22 variables.

How to scale up? We use the `invoke()` family of functions in purrr. `invoke()` itself is a pipe-friendly wrapper around `do.call()`. `invoke_map()` and friends allow you to map over functions and arguments in parallel, analagous to `map2()`. The big difference with `invoke_map()` is that the two primary inputs are a vector of functions and a vector of argument values. We use `invoke_map_df()` here, which is the variant that requests to get a data frame back.

```{r}
(mkts_df <- invoke_map_df(.f = set_names(vdf$mapper, vdf$vname),
                          .x = transpose(vdf[c(".f", ".null")]),
                          food_mkts))
```

Partial victory is ours again! A data frame, one row per food market, with 22 variables. How did this work? Line-by-line:

  * `.f = set_names(vdf$mapper, vdf$vname)` specifies the list of functions to iterate over: the type-specific mappers we pre-selected above. I bother to apply names here because they propagate to the variable names of the data frame.
  * `.x = transpose(vdf[c(".f", ".null")])` gives the parallel list of function arguments, namely the strings specifying named elements to extract and the value to insert in place of explicit `NULL`. Two things that might be confusing:
    - This instance of `.f` is at the variable level, i.e. inside our iteration. In the previous line, `.f` refers to the overall operation, i.e. setting up our iteration.
    - `transpose()` is needed to repackage the `.f` and `.null` variables. Instead of a list of two same-length vectors, we want a single list holding lists of length two.
  * `food_mkts` is the `...` argument. It's a common input across all units of iteration.

### "Columns first" Recap

Self-contained code to go from raw JSON to data frame for 22 out of 23 variables, with explicit type specification.

```{r eval = FALSE}
food_mkts_raw <- fromJSON("foodMarkets/retail_food_markets.json",
                          simplifyVector = FALSE)
cnames <- food_mkts_raw[[c("meta", "view", "columns")]] %>% 
  map_chr("name")
food_mkts <- food_mkts_raw[["data"]] %>% 
  map(set_names, cnames)

vdf <- tribble(
             ~ .f,      ~ type,
            "sid",   "integer",
             "id", "character",
       "position",   "integer",
     "created_at",   "integer",
   "created_meta", "character",
     "updated_at",   "integer",
   "updated_meta", "character",
           "meta", "character",
         "County", "character",
 "License Number", "character",
 "Operation Type", "character",
     "Estab Type", "character",
    "Entity Name", "character",
       "DBA Name", "character",
  "Street Number", "character",
    "Street Name", "character",
 "Address Line 2", "character",
 "Address Line 3", "character",
           "City", "character",
          "State", "character",
       "Zip Code", "character",
 "Square Footage", "character")
vdf <- vdf %>% 
  mutate(vname = .f %>% tolower() %>% gsub("\\s+", "_", .),
         mapper = c(integer = "map_int", character = "map_chr")[type],
         .null = list(integer = NA_integer_,
                      character = NA_character_)[type])
mkts_df <- invoke_map_df(.f = set_names(vdf$mapper, vdf$vname),
                         .x = transpose(vdf[c(".f", ".null")]),
                         food_mkts)
```

## Rescue `Location`

We need to process the 23rd variable, `Location`. There are no big new principles here, so this will heavy on the code and light on explanation.

### Prepare `Location`

Pull `Location` into its own list and inspect one element.

```{r}
loc_raw <- food_mkts %>%
  map("Location")
loc_raw[[345]]
```

Ah, more key-less JSON, giving rise to name-less lists. Before we do anything else, extract names from the `meta` component and apply them. Re-inspect an element to verify we've succeeded.

```{r}
i <- which(cnames == "Location")
location_meta <- food_mkts_raw[[c("meta", "view", "columns")]][[i]]
(lnames <- location_meta[["subColumnTypes"]] %>% flatten_chr())
loc_raw <- loc_raw %>% 
  map(set_names, lnames)
loc_raw[[345]]
```

### Parse the JSON string holding human address

The `human_address` element of each location holds a JSON string:

  * Parse it.
  * Rowbind elements into a data frame.
  * Prepend `ha_` to the variable names to avoid name conflicts downstream.

```{r}
ha <- loc_raw %>% 
  map("human_address") %>%
  map_df(fromJSON) %>% 
  set_names(paste0("ha_", names(.)))
head(ha)
```

### Convert remaining variables into a data frame

Convert everything else left in the location data to a data frame. Convert `latitude` and `longitude` into numeric variables and make sure they don't look insane. [This New York latitude and longitude map](http://www.mapsofworld.com/usa/states/new-york/lat-long.html) indicates that latitude should range between 40° and 50° and longitude between -80° and -72°.


```{r}
ee <- loc_raw %>% 
  map_df(safe_extract, lnames[lnames != "human_address"]) %>% 
  mutate_at(vars(latitude, longitude), as.numeric)
head(ee)
ee %>% 
  select(latitude, longitude) %>% 
  map(summary)
```

## Put everything together

Column bind the main data frame `mkts_df` with the human address data in `ha` and everything else from `Location`.
```{r}
(mkts_df <- bind_cols(mkts_df, ha, ee))
```

Trim any leading/trailing whitespace from the character variables.

```{r}
mkts_df <- mkts_df %>% 
  mutate_if(is.character, trimws)
```

Take a look at some of the most interpretable variables.

```{r}
mkts_df %>% 
  select(dba_name, city, latitude, longitude, square_footage)
```