generated from dcl-docs/book
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathmanip-one-table.Rmd
155 lines (108 loc) · 4.46 KB
/
manip-one-table.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# Other single-table verbs
You've learned the most important verbs for data analysis: `filter()`, `mutate()`, `group_by()` and `summarize()`. There are a number of other verbs that are not quite as important but still come in handy from time-to-time. The goal of this section is to familiarize you with their purpose and basic operation.
```{r setup, message = FALSE}
library(tidyverse)
library(nycflights13)
```
## Select
Most of the datasets you'll work with in this class only have a relatively small number of variables, and generally you don't need to reduce further. In real life, you'll sometimes encounter datasets with hundreds or even thousands of variables, and the first challenge is just to narrow down to a useful subset. Solving that problem is the job of `select()`.
`select()` allows you to work with column names using a handful of helper functions:
* `starts_with("x")` and `ends_with("x")` select variables that start with a
common prefix or end with a common suffix.
* `contains("x")` selects variables that contain a phrase. `matches("x.y")`
select all variables that match a given regular expression (which you'll
learn about later in the course).
* `a:e` selects all variables from variable `a` to variable `e` inclusive.
You can also select a single variable just by using its name directly.
```{r}
flights %>%
select(year:day, ends_with("delay"))
```
To remove variables from selection, put an exclamation point `!` in front of the expression.
```{r}
flights %>%
select(!starts_with("dep"))
```
## Rename
To change the name of a variable use `df %>% rename(new_name = old_name)`. If you have trouble remembering which sides old and new go on, remember it's the same order as `mutate()`.
```{r}
flights %>%
rename(tail_num = tailnum)
```
If you're selecting and renaming, note that you can also use `select()` to rename. This sometimes allows you to save a step.
```{r}
flights %>%
select(year:day, tail_num = tailnum)
```
## Change column order
Use `relocate()` to change column positions. It uses the same syntax as `select()`. The arguments `.before` and `.after` indicate where to move columns. If neither argument is supplied, columns are moved before the first column.
```{r}
flights %>%
relocate(carrier, flight, origin, dest, .after = dep_time)
```
## Transmute
Transmute is a minor variation of `mutate()`. The main difference is that it drops any variables that you didn't explicitly mention. It's a useful shortcut for `mutate()` + `select()`.
```{r}
df <- tibble(x = 1:3, y = 3:1)
# mutate() keeps x and y
df %>%
mutate(z = x + y)
# transmute() drops x and y
df %>%
transmute(z = x + y)
```
## Arrange
`arrange()` lets you change the order of the rows. To put a column in descending order, use `desc()`.
```{r}
flights %>%
arrange(desc(dep_delay))
flights %>%
arrange(year, month, day)
```
## Distinct
`distinct()` removes duplicates from a dataset. The result is ordered by first occurence in original dataset.
```{r}
flights %>%
distinct(carrier, flight)
```
## Slice rows
`slice()` allows you to pick rows by position, by group.
```{r}
# Rows 10 - 14
flights %>%
slice(10:14)
```
`slice_head()` and `slice_tail()` show the first (or last) rows of the data frame or its groups.
```{r}
# First 5 flights to each destination
flights %>%
select(year:dep_time, carrier, flight, origin, dest) %>%
group_by(dest) %>%
slice_head(n = 5)
# Last 5 flights overall
flights %>%
select(year:dep_time, carrier, flight, origin, dest) %>%
slice_tail(n = 5)
```
`slice_min()` and `slice_max()` select rows by the highest or lowest values of a variable. By default all ties are returned, in which case you may receive more rows than you ask for.
```{r}
# Flights with the last time_hour
flights %>%
select(time_hour, carrier, flight, origin, dest) %>%
slice_max(n = 1, time_hour)
```
## Sample
When working with very large datasets, sometimes it's convenient to reduce to a smaller dataset, just by taking a random sample. That's the job of `slice_sample()`. You use the `n` argument to select the same number of observations from each group. You use the `prop` argument to select the same proportion.
```{r}
# Popular destinations
popular_dest <-
flights %>%
group_by(dest) %>%
filter(n() >= 1000)
# Dataset with a random sample of 100 flights to each destination
popular_dest %>%
slice_sample(n = 100)
# Dataset with a random sample of 1% of the flights to each destination
popular_dest %>%
slice_sample(prop = 0.01)
```