manip-one-table.Rmd


# Other single-table verbs

You've learned the most important verbs for data analysis: `filter()`, `mutate()`, `group_by()` and `summarize()`. There are a number of other verbs that are not quite as important but still come in handy from time-to-time. The goal of this section is to familiarize you with their purpose and basic operation.

```{r setup, message = FALSE}
library(tidyverse)
library(nycflights13)
```

## Select

Most of the datasets you'll work with in this class only have a relatively small number of variables, and generally you don't need to reduce further. In real life, you'll sometimes encounter datasets with hundreds or even thousands of variables, and the first challenge is just to narrow down to a useful subset. Solving that problem is the job of `select()`.

`select()` allows you to work with column names using a handful of helper functions:

* `starts_with("x")` and `ends_with("x")` select variables that start with a
  common prefix or end with a common suffix.
  
* `contains("x")` selects variables that contain a phrase. `matches("x.y")`
  select all variables that match a given regular expression (which you'll 
  learn about later in the course).
  
* `a:e` selects all variables from variable `a` to variable `e` inclusive.

You can also select a single variable just by using its name directly.

```{r}
flights %>% 
  select(year:day, ends_with("delay"))
```

To remove variables from selection, put an exclamation point `!` in front of the expression.

```{r}
flights %>% 
  select(!starts_with("dep"))
```

## Rename

To change the name of a variable use `df %>% rename(new_name = old_name)`. If you have trouble remembering which sides old and new go on, remember it's the same order as `mutate()`.

```{r}
flights %>% 
  rename(tail_num = tailnum)
```

If you're selecting and renaming, note that you can also use `select()` to rename. This sometimes allows you to save a step.

```{r}
flights %>% 
  select(year:day, tail_num = tailnum)
```

## Change column order

Use `relocate()` to change column positions. It uses the same syntax as `select()`. The arguments `.before` and `.after` indicate where to move columns. If neither argument is supplied, columns are moved before the first column.

```{r}
flights %>% 
  relocate(carrier, flight, origin, dest, .after = dep_time)
```

## Transmute

Transmute is a minor variation of `mutate()`. The main difference is that it drops any variables that you didn't explicitly mention. It's a useful shortcut for `mutate()` + `select()`.

```{r}
df <- tibble(x = 1:3, y = 3:1)

# mutate() keeps x and y
df %>% 
  mutate(z = x + y)

# transmute() drops x and y
df %>% 
  transmute(z = x + y)
```

## Arrange

`arrange()` lets you change the order of the rows. To put a column in descending order, use `desc()`.

```{r}
flights %>% 
  arrange(desc(dep_delay))

flights %>% 
  arrange(year, month, day)
```

## Distinct

`distinct()` removes duplicates from a dataset. The result is ordered by first occurence in original dataset.

```{r}
flights %>% 
  distinct(carrier, flight)
```

## Slice rows

`slice()` allows you to pick rows by position, by group.

```{r}
# Rows 10 - 14
flights %>% 
  slice(10:14)
```

`slice_head()` and `slice_tail()` show the first (or last) rows of the data frame or its groups. 

```{r}
# First 5 flights to each destination
flights %>% 
  select(year:dep_time, carrier, flight, origin, dest) %>% 
  group_by(dest) %>%
  slice_head(n = 5)

# Last 5 flights overall
flights %>% 
  select(year:dep_time, carrier, flight, origin, dest) %>% 
  slice_tail(n = 5)
```

`slice_min()` and `slice_max()` select rows by the highest or lowest values of a variable. By default all ties are returned, in which case you may receive more rows than you ask for.

```{r}
# Flights with the last time_hour
flights %>% 
  select(time_hour, carrier, flight, origin, dest) %>%
  slice_max(n = 1, time_hour)
```

## Sample

When working with very large datasets, sometimes it's convenient to reduce to a smaller dataset, just by taking a random sample. That's the job of `slice_sample()`. You use the `n` argument to select the same number of observations from each group. You use the `prop` argument to select the same proportion.

```{r}
# Popular destinations
popular_dest <- 
  flights %>%
  group_by(dest) %>%
  filter(n() >= 1000)

# Dataset with a random sample of 100 flights to each destination
popular_dest %>% 
  slice_sample(n = 100)

# Dataset with a random sample of 1% of the flights to each destination
popular_dest %>% 
  slice_sample(prop = 0.01)
```