forked from Tazinho/Advanced-R-Solutions
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2-04-Subsetting.Rmd
executable file
·148 lines (100 loc) · 5.88 KB
/
2-04-Subsetting.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
```{r, include=FALSE}
source("common.R")
```
# Subsetting
## Selecting multiple elements
1. __[Q]{.Q}__: Fix each of the following common data frame subsetting errors:
```{r, eval = FALSE}
mtcars[mtcars$cyl = 4, ] # use `==` (instead of `=`)
mtcars[-1:4, ] # use `-(1:4)` (instead of `-1:4`)
mtcars[mtcars$cyl <= 5] # `,` is missing
mtcars[mtcars$cyl == 4 | 6, ] # use `mtcars$cyl == 6` (instead of `6`)
```
2. __[Q]{.Q}__: Why does the following code yield five missing values? (Hint: why is it different from `x[NA_real_]`?)
```{r}
x <- 1:5
x[NA]
```
__[A]{.solved}__: `NA` has logical type and logical vectors are recycled to the same length as the vector being subset, i.e. `x[NA]` is recycled to `x[NA, NA, NA, NA, NA]`.
3. __[Q]{.Q}__: What does `upper.tri()` return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?
```{r, eval = FALSE}
x <- outer(1:5, 1:5, FUN = "*")
x[upper.tri(x)]
```
__[A]{.solved}__: `upper.tri()` returns a logical matrix containing `TRUE` for all upper diagonal elements and `FALSE` otherwise. The implementation of `upper.tri()` is straightforward, but quite interesting as it uses `.row(dim(x)) <= .col(dim(x))` to create the logical matrix. Its subsetting-behaviour will be identical to subsetting with logical matrices, where all elements that correspond to `TRUE` will be selected. We don't need to treat this form of subsetting in a special way.
<!-- The text doesn't mention subsetting with logical matrices so I think this deserves a little more text. I'd make sure to point out that this returns a vector, not a matrix -->
4. __[Q]{.Q}__: Why does `mtcars[1:20]` return an error? How does it differ from the similar `mtcars[1:20, ]`?
__[A]{.solved}__: When subsetting a data frame with a single vector, it behaves the same way as subsetting a list of the columns, so `mtcars[1:20]` would return a data frame of the first 20 columns of the dataset. But `mtcars` has only 11 columns, so the index will be out of bounds and an error is thrown. `mtcars[1:20, ]` is subsetted with two vectors, so 2d subsetting kicks in, and the first index refers to rows.
5. __[Q]{.Q}__: Implement your own function that extracts the diagonal entries from a matrix (it should behave like `diag(x)` where `x` is a matrix).
__[A]{.solved}__: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.
```{r}
diag2 <- function(x){
n <- min(nrow(x), ncol(x))
idx <- cbind(seq_len(n), seq_len(n))
x[idx]
}
# Let's check if it works
(x <- matrix(1:30, 5))
diag(x)
diag2(x)
```
6. __[Q]{.Q}__: What does `df[is.na(df)] <- 0` do? How does it work?
__[A]{.solved}__: This expression replaces the `NA`s in `df` with `0`. Here `is.na(df)` returns a logical matrix that encodes the position of the missing values in `df`. Subsetting and assignment are then combined to replace only the missing values.
## Selecting a single element
1. __[Q]{.Q}__: Brainstorm as many ways as possible to extract the third value from the `cyl` variable in the `mtcars` dataset.
__[A]{.solved}__: Base R already provides an abundance of possibilities:
```{r}
# Select column first
mtcars$cyl[[3]]
mtcars[ , "cyl"][[3]]
mtcars[["cyl"]][[3]]
with(mtcars, cyl[[3]])
# Select row first
mtcars[3, ]$cyl
mtcars[3, "cyl"]
mtcars[3, ][ , "cyl"]
mtcars[3, ][["cyl"]]
# Select simultaneously
mtcars[3, 2]
mtcars[[c(2, 3)]]
```
2. __[Q]{.Q}__: Given a linear model, e.g., `mod <- lm(mpg ~ wt, data = mtcars)`, extract the residual degrees of freedom. Extract the R squared from the model summary (`summary(mod)`).
__[A]{.solved}__: `mod` has the type list, which opens up several possibilities:
```{r}
mod <- lm(mpg ~ wt, data = mtcars)
mod$df.residual # output preserved
mod$df.res # `$` allows partial matching
mod["df.residual"] # list output
mod[["df.residual"]] # output preserved
```
The same also applies to `summary(mod)`, so we could use i.e.:
```{r, eval = FALSE}
summary(mod)$r.squared
```
(Tip: The `broom`-package provides a very useful approach to work with models in a tidy way).
## Applications
1. __[Q]{.Q}__: How would you randomly permute the columns of a data frame? (This is an important technique in random forests). Can you simultaneously permute the rows and columns in one step?
__[A]{.solved}__: This can be achieved by combining `` `[` `` and `sample()`:
```{r,eval = FALSE}
# Permute columns
iris[sample(ncol(iris))]
# Permute columns and rows in one step
iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE]
```
2. __[Q]{.Q}__: How would you select a random sample of `m` rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?
__[A]{.solved}__: Selecting `m` random rows from a data frame can be achieved through subsetting.
```{r, eval = FALSE}
m <- 10
iris[sample(nrow(iris), m), , drop = FALSE]
```
Keeping subsequent rows together as a "[blocked sample](https://mlr.mlr-org.com/articles/tutorial/resample.html#stratification-blocking-and-grouping)" requires only some caution to get the start- and end-index correct.
```{r, eval = FALSE}
start <- sample(nrow(iris) - m + 1, 1)
end <- start + m - 1
iris[start:end, , drop = FALSE]
```
3. __[Q]{.Q}__: How could you put the columns in a data frame in alphabetical order?
__[A]{.solved}__: We combine `order()` with `[`:
```{r, eval = FALSE}
iris[order(names(iris))]
```