-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy path06-DataWrangling.Rmd
469 lines (336 loc) · 20.5 KB
/
06-DataWrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
# Data Wrangling {#datawrangline}
This chapter focuses on some of the most frequently used data manipulations and shows how to implement them in R and Python. It is critical to explore the data with descriptive statistics (mean, standard deviation, etc.) and data visualization before analysis. Transform data so that the data structure is in line with the requirements of the model. You also need to summarize the results after analysis.
When the data is too large to fit in a computer's memory, we can use some big data analytics engine like Spark on a cloud platform (see Chapter \@ref(bigdatacloudplatform)). Even the user interface of many data platforms is much more friendly now, it is still easier to manipulate the data as a local data frame. Spark's R and Python interfaces aim to keep the data manipulation syntax consistent with popular packages for local data frames. As shown in Section \@ref(leveragesparkr), we can run nearly all of the `dplyr` functions on a spark data frame once setting up the Spark environment. And the Python interface `pyspark` uses a similar syntax as `pandas`. This chapter focuses on data manipulations on standard data frames, which is also the foundation of big data manipulation.
Even when the data can fit in the memory, there may be a situation where it is slow to read and manipulate due to a relatively large size. Some R packages can make the process faster with the cost of familiarity, especially for data wrangling. But it avoids the hurdle of setting up Spark cluster and working in an unfamiliar environment. It is not a topic in this chapter but Appendix \@ref(largelocaldata) briefly introduces some of the alternative R packages to read, write and wrangle a data set that is relatively large but not too big to fit in the memory.
There are many fundamental data processing functions in R. They lack consistent coding and can't flow together easily. Learning all of them is a daunting task and unnecessary. R Studio developed a collection of packages and bundled them in `tidyverse` to systemize data wrangling and analysis tasks. You can see the package list in `tidyverse` on the [website](https://www.tidyverse.org/packages/). This chapter focuses on some of the `tidyverse` packages to do data wrangling for the following reasons:
- Those packages are widely used among R users in data science.
- The code is more efficient.
- The code syntax is consistent, which makes it easier to remember and read.
Section \@ref(applyfamilyinbaser) introduces some base R functions outside the `tidyverse` universe, such as `apply()`, `lapply()` and `sapply()`. They are complementary functions when you are working with a data frame.
Load the R packages first:
```{r, message=FALSE,results="hide"}
# install packages from CRAN
p_needed <- c('dplyr','tidyr')
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
install.packages(p_to_install)
}
lapply(p_needed, require, character.only = TRUE)
```
## Summarize Data
### `dplyr` package
`dplyr` provides a flexible grammar of data manipulation focusing on tools for working with data frames (hence the `d` in the name). It is faster and more friendly:
- It identifies the most important data manipulations and make them easy to use from R.
- It performs faster for in-memory data by writing key pieces in C++ using `Rcpp`.
- The interface is the same for data frame, data table or database.
We will illustrate the following functions in order using the clothing company data:
1. Display
1. Subset
1. Summarize
1. Create new variable
1. Merge
```{r}
# Read data
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
```
#### Display
- `tbl_df()`: Convert the data to `tibble` which offers better checking and printing capabilities than traditional data frames. It will adjust output width according to fit the current window.
```r
tbl_df(sim.dat)
```
- `glimpse()`: This is like a transposed version of `tbl_df()`
```r
glimpse(sim.dat)
```
#### Subset
Get rows with `income` more than 300000:
```r
filter(sim.dat, income >300000) %>%
tbl_df()
```
Here we use the operator `%>%`. It is called a "pipe operator" which pipes a value forward into an expression or function call. What you get in the left operation will be the first argument or the only argument in the right operation.
```r
x %>% f(y) = f(x, y)
y %>% f(x, ., z) = f(x, y, z )
```
It is an operator from `magrittr` which can be really beneficial. The following R code is difficulty to read and understand without using the pipe operator.
```r
ave_exp <- filter(
summarise(
group_by(
filter(
sim.dat,
!is.na(income)
),
segment
),
ave_online_exp = mean(online_exp),
n = n()
),
n > 200
)
```
The same function with pipe operator "`%>%`":
```r
ave_exp <- sim.dat %>%
filter(!is.na(income)) %>%
group_by(segment) %>%
summarise(
ave_online_exp = mean(online_exp),
n = n() ) %>%
filter(n > 200)
```
It is much easier to read:
1. Delete observations from `sim.dat` with missing income values
2. Group the data from step 1 by variable `segment`
3. Calculate mean of online expense for each segment and save the result as a new variable named `ave_online_exp`
4. Calculate the size of each segment and saved it as a new variable named `n`
5. Get segments with size larger than 200
You can use `distinct()` to delete duplicated rows.
```r
dplyr::distinct(sim.dat)
```
`sample_frac()` will randomly select some rows with a specified percentage. `sample_n()` can randomly select rows with a specified number.
```r
dplyr::sample_frac(sim.dat, 0.5, replace = TRUE)
dplyr::sample_n(sim.dat, 10, replace = TRUE)
```
`slice()` will select rows by position:
```r
dplyr::slice(sim.dat, 10:15)
```
It is equivalent to `sim.dat[10:15,]`.
`top_n()` will select the order top n entries:
```r
dplyr::top_n(sim.dat,2,income)
```
If you want to select columns instead of rows, you can use `select()`. The following are some sample codes:
```r
# select by column name
dplyr::select(sim.dat,income,age,store_exp)
# select columns whose name contains a character string
dplyr::select(sim.dat, contains("_"))
# select columns whose name ends with a character string
# similar there is "starts_with"
dplyr::select(sim.dat, ends_with("e"))
# select columns Q1,Q2,Q3,Q4 and Q5
select(sim.dat, num_range("Q", 1:5))
# select columns whose names are in a group of names
dplyr::select(sim.dat, one_of(c("age", "income")))
# select columns between age and online_exp
dplyr::select(sim.dat, age:online_exp)
# select all columns except for age
dplyr::select(sim.dat, -age)
```
#### Summarize
Let us use a standard marketing problem, customer segmentation, to illustrate how to summarize data. It usually starts with designing survey and collecting data. Then run a cluster analysis on the data to get customer segments. Once we have different segments, the next is to understand how each group of customer look like by summarizing some key metrics. For example, we can do the following data aggregation for different segments of clothes customers.
```{r, message = FALSE}
dat_summary <- sim.dat %>%
dplyr::group_by(segment) %>%
dplyr::summarise(Age = round(mean(na.omit(age)), 0),
FemalePct = round(mean(gender == "Female"), 2),
HouseYes = round(mean(house == "Yes"), 2),
store_exp = round(mean(na.omit(store_exp),
trim = 0.1), 0),
online_exp = round(mean(online_exp), 0),
store_trans = round(mean(store_trans), 1),
online_trans = round(mean(online_trans), 1))
# transpose the data frame for showing purpose
# due to the limit of output width
cnames <- dat_summary$segment
dat_summary <- dplyr::select(dat_summary, - segment)
tdat_summary <- t(dat_summary) %>% data.frame()
names(tdat_summary) <- cnames
tdat_summary
```
Now, let's look at the code in more details.
The first line `sim.dat` is easy. It is the data you want to work on. The second line `group_by(segment)` tells R that in the following steps you want to summarise by variable `segment`. Here we only summarize data by one categorical variable, but you can group by multiple variables, such as `group_by(segment, house)`. The third argument `summarise` tells R the manipulation(s) to do. Then list the exact actions inside `summarise()`. For example, ` Age = round(mean(na.omit(age)),0)` tell R the following things:
1. Calculate the mean of column `age` ignoring missing value for each customer segment
2. Round the result to the specified number of decimal places
3. Store the result in a new variable named `Age`
The rest of the command above is similar. In the end, we calculate the following for each segment:
1. `Age`: average age for each segment
2. `FemalePct`: percentage for each segment
1. `HouseYes`: percentage of people who own a house
2. `stroe_exp`: average expense in store
3. `online_exp`: average expense online
4. `store_trans`: average times of transactions in the store
5. `online_trans`: average times of online transactions
There is a lot of information you can extract from those simple averages.
- Conspicuous: average age is about 40. It is a group of middle-age wealthy people. 1/3 of them are female, and 2/3 are male. They buy regardless the price. Almost all of them own house (0.86).
- Price: They are older people with average age 60. Nearly all of them own a house (0.94). They are less likely to purchase online (`store_trans = 6` while `online_trans = 3`). It is the only group that is less likely to buy online.
- Quality: The average age is 35. They are not way different with Conspicuous regarding age. But they spend much less. The percentages of male and female are similar. They prefer online shopping. More than half of them don't own a house (0.66).
- Style: They are young people with average age 24. The majority of them are female (0.81). Most of them don't own a house (0.73). They are very likely to be digital natives and prefer online shopping.
You may notice that Style group purchase more frequently online (`online_trans`) but the expense (`online_exp`) is not higher. It makes us wonder what is the average expense each time, so you have a better idea about the price range of the group.
The analytical process is aggregated instead of independent steps. The current step will shed new light on what to do next. Sometimes you need to go back to fix something in the previous steps. Let's check average one-time online and instore purchase amounts:
```{r}
sim.dat %>%
group_by(segment) %>%
summarise(avg_online = round(sum(online_exp)/sum(online_trans), 2),
avg_store = round(sum(store_exp)/sum(store_trans), 2))
```
Price group has the lowest averaged one-time purchase. The Conspicuous group will pay the highest price. When we build customer profile in real life, we will also need to look at the survey summarization. You may be surprised how much information simple data manipulations can provide.
Another comman task is to check which column has missing values. It requires the program to look at each column in the data. In this case you can use `summarise_all`:
```{r}
# apply function anyNA() to each column
# you can also assign a function vector
# such as: c("anyNA","is.factor")
dplyr::summarise_all(sim.dat, funs_(c("anyNA")))
```
The above code returns a vector indicating if there is any value missing in each column.
#### Create new variable
There are often situations where you need to create new variables. For example, adding online and store expense to get total expense. In this case, you will apply a function to the columns and return a column with the same length. `mutate()` can do it for you and append one or more new columns:
```r
dplyr::mutate(sim.dat, total_exp = store_exp + online_exp)
```
The above code sums up two columns and appends the result (`total_exp`) to `sim.dat`. Another similar function is `transmute()`. The difference is that `transmute()` will delete the original columns and only keep the new ones.
```r
dplyr::transmute(sim.dat, total_exp = store_exp + online_exp)
```
#### Merge
Similar to SQL, there are different joins in `dplyr`. We create two baby data sets to show how the functions work.
```{r}
(x <- data.frame(cbind(ID = c("A", "B", "C"), x1 = c(1, 2, 3))))
(y <- data.frame(cbind(ID = c("B", "C", "D"), y1 = c(T, T, F))))
```
```{r}
# join to the left
# keep all rows in x
left_join(x, y, by = "ID")
```
```{r}
# get rows matched in both data sets
inner_join(x, y, by = "ID")
```
```{r}
# get rows in either data set
full_join(x, y, by = "ID")
```
```r
# filter out rows in x that can be matched in y
# it doesn't bring in any values from y
semi_join(x, y, by = "ID")
```
```r
# the opposite of semi_join()
# it gets rows in x that cannot be matched in y
# it doesn't bring in any values from y
anti_join(x, y, by = "ID")
```
There are other functions (`intersect()`, `union()` and `setdiff()`). Also the data frame version of `rbind` and `cbind` which are `bind_rows()` and `bind_col()`. We are not going to go through them all. You can try them yourself. If you understand the functions we introduced so far. It should be easy for you to figure out the rest.
### `apply()`, `lapply()` and `sapply()` in base R {#applyfamilyinbaser}
There are some powerful functions to summarize data in base R, such as `apply()`, `lapply()` and `sapply()`. They do the same basic things and are all from "apply" family: apply functions over parts of data. They differ in two important respects:
1. the type of object they apply to
1. the type of result they will return
When do we use `apply()`? When we want to apply a function to margins of an array or matrix. That means our data need to be structured. The operations can be very flexible. It returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
For example you can compute row and column sums for a matrix:
```{r}
## simulate a matrix
x <- cbind(x1 =1:8, x2 = c(4:1, 2:5))
dimnames(x)[[1]] <- letters[1:8]
apply(x, 2, mean)
col.sums <- apply(x, 2, sum)
row.sums <- apply(x, 1, sum)
```
You can also apply other functions:
```{r}
ma <- matrix(c(1:4, 1, 6:8), nrow = 2)
ma
apply(ma, 1, table) #--> a list of length 2
apply(ma, 1, stats::quantile) # 5 x n matrix with rownames
```
Results can have different lengths for each call. This is a trickier example. What will you get?
```r
## Example with different lengths for each call
z <- array(1:24, dim = 2:4)
zseq <- apply(z, 1:2, function(x) seq_len(max(x)))
zseq ## a 2 x 3 matrix
typeof(zseq) ## list
dim(zseq) ## 2 3
zseq[1,]
apply(z, 3, function(x) seq_len(max(x)))
```
- `lapply()` applies a function over a list, data.frame or vector and returns a list of the same length.
- `sapply()` is a user-friendly version and wrapper of `lapply()`. By default it returns a vector, matrix or if `simplify = "array"`, an array if appropriate. `apply(x, f, simplify = FALSE, USE.NAMES = FALSE)` is the same as `lapply(x, f)`. If `simplify=TRUE`, then it will return a `data.frame` instead of `list`.
Let's use some data with context to help you better understand the functions.
- Get the mean and standard deviation of all numerical variables in the dataset.
```{r}
# Get numerical variables
sdat <- sim.dat[, lapply(sim.dat, class) %in% c("integer", "numeric")]
## Try the following code with apply() function apply(sim.dat,2,class)
## What is the problem?
```
The data frame `sdat` only includes numeric columns. Now we can go head and use `apply()` to get mean and standard deviation for each column:
```{r}
apply(sdat, MARGIN = 2, function(x) mean(na.omit(x)))
```
Here we defined a function using ` function(x) mean(na.omit(x))`. It is a very simple function. It tells R to ignore the missing value when calculating the mean. ` MARGIN = 2` tells R to apply the function to each column. It is not hard to guess what ` MARGIN = 1` mean. The result show that the average online expense is much higher than store expense. You can also compare the average scores across different questions. The command to calculate standard deviation is very similar. The only difference is to change `mean()` to `sd()`:
```{r}
apply(sdat, MARGIN = 2, function(x) sd(na.omit(x)))
```
Even the average online expense is higher than store expense, the standard deviation for store expense is much higher than online expense which indicates there is very likely some big/small purchase in store. We can check it quickly:
```{r}
summary(sdat$store_exp)
summary(sdat$online_exp)
```
There are some odd values in store expense. The minimum value is -500 which indicates that you should preprocess data before analyzing it. Checking those simple statistics will help you better understand your data. It then gives you some idea how to preprocess and analyze them. How about using `lapply()` and `sapply()`?
Run the following code and compare the results:
```r
lapply(sdat, function(x) sd(na.omit(x)))
sapply(sdat, function(x) sd(na.omit(x)))
sapply(sdat, function(x) sd(na.omit(x)), simplify = FALSE)
```
## Tidy and Reshape Data
"Tidy data" represents the information from a dataset as data frames where each row is an observation, and each column contains the values of a variable (i.e., an attribute of what we are observing). Depending on the situation, the requirements on what to present as rows and columns may change. To make data easy to work with the problem at hand. In practice, we often need to convert data between the "wide" and the "long" format. The process feels like kneading the dough.
In this section, we will show how to tidy and reshape data using `tidyr` packages. It is built to simplify the process of creating tidy data. We will go through four fundamental functions:
- `gather()`: reshape data from wide to long
- `spread()`: reshape data from long to wide
- `separate()`: split a column into multiple columns
- `unite()`: combine multiple columns to one column
Take a baby subset of our exemplary clothes consumers data to illustrate:
```{r}
sdat<-sim.dat[1:5,1:6]
sdat
```
For the above data `sdat`, what if we want to reshape the data to have a column indicating the purchasing channel (i.e. from `store_exp` or `online_exp`) and a second column with the corresponding expense amount? Assume we want to keep the rest of the columns the same. It is a task to change data from "wide" to "long".
```{r}
dat_long <- tidyr::gather(sdat, "Channel","Expense",
store_exp, online_exp)
dat_long
```
The above code gathers two variables (`store_exp` and `online_exp`), and collapses them into key-value pairs (`Channel` and `Expense`), duplicating all other columns as needed.
You can run a regression to study the effect of purchasing channel as follows:
```r
# Here we use all observations from sim.dat
# Don't show result here
msdat <- tidyr::gather(sim.dat[, 1:6], "Channel","Expense",
store_exp, online_exp)
fit <- lm(Expense ~ gender + house + income + Channel + age,
data = msdat)
summary(fit)
```
Sometimes we want to reshape the data from "long" to "wide". For example, you want to compare the online and in-store expense between male and female based on house ownership.
We need to reshape the wide data frame `dat_wide` to a long format by spreading the key-value pairs across multiple columns. And then summarize the long data frame `dat_long`, grouping by`house` and `gender`.
```{r, message = FALSE}
dat_wide = tidyr::spread(dat_long, Channel, Expense)
# you can check what dat_long is like
dat_wide %>%
dplyr::group_by(house, gender) %>%
dplyr::summarise(total_online_exp = sum(online_exp),
total_store_exp = sum(store_exp))
```
The above code also uses the functions in the `dplyr` package introduced in the previous section. Here we use `package::function` to make clear the package name. It is not necessary if the package is already loaded.
Another pair of functions that do opposite manipulations are `separate()` and `unite()`.
```{r}
sepdat<- dat_long %>%
separate(Channel, c("Source", "Type"))
sepdat
```
You can see that the function separates the original column "`Channel`" to two new columns "`Source`" and "`Type`". You can use `sep =` to set the string or regular expression to separate the column. By default, it is "`_`".
The `unite()` function will do the opposite: combining two columns. It is the generalization of `paste()` to a data frame.
```{r}
sepdat %>%
unite("Channel", Source, Type, sep = "_")
```
The reshaping manipulations may be the trickiest part. You have to practice a lot to get familiar with those functions. Unfortunately, there is no shortcut.