2-04-Subsetting.Rmd

```{r, include=FALSE}
source("common.R")
```

# Subsetting 

## Selecting multiple elements

1. __[Q]{.Q}__: Fix each of the following common data frame subsetting errors:

    ```{r, eval = FALSE}
    mtcars[mtcars$cyl = 4, ]       # use `==`              (instead of `=`)
    mtcars[-1:4, ]                 # use `-(1:4)`          (instead of `-1:4`)
    mtcars[mtcars$cyl <= 5]        # `,` is missing
    mtcars[mtcars$cyl == 4 | 6, ]  # use `mtcars$cyl == 6` (instead of `6`)
    ```  

2. __[Q]{.Q}__: Why does the following code yield five missing values? (Hint: why is it different from `x[NA_real_]`?)

    ```{r}
    x <- 1:5
    x[NA]
    ```
   
   __[A]{.solved}__: `NA` has logical type and logical vectors are recycled to the same length as the vector being subset, i.e. `x[NA]` is recycled to `x[NA, NA, NA, NA, NA]`.
    
3. __[Q]{.Q}__: What does `upper.tri()` return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

    ```{r, eval = FALSE}
    x <- outer(1:5, 1:5, FUN = "*")
    x[upper.tri(x)]
    ```  
    
   __[A]{.solved}__: `upper.tri()` returns a logical matrix containing `TRUE` for all upper diagonal elements and `FALSE` otherwise. The implementation of `upper.tri()` is straightforward, but quite interesting as it uses `.row(dim(x)) <= .col(dim(x))` to create the logical matrix. Its subsetting-behaviour will be identical to subsetting with logical matrices, where all elements that correspond to `TRUE` will be selected. We don't need to treat this form of subsetting in a special way.
   
   <!-- The text doesn't mention subsetting with logical matrices so I think this deserves a little more text. I'd make sure to point out that this returns a vector, not a matrix -->

4. __[Q]{.Q}__: Why does `mtcars[1:20]` return an error? How does it differ from the similar `mtcars[1:20, ]`?
   
   __[A]{.solved}__: When subsetting a data frame  with a single vector, it behaves the same way as subsetting a list of the columns, so `mtcars[1:20]` would return a data frame of the first 20 columns of the dataset. But `mtcars` has only 11 columns, so the index will be out of bounds and an error is thrown. `mtcars[1:20, ]` is subsetted with two vectors, so 2d subsetting kicks in, and the first index refers to rows.

5. __[Q]{.Q}__: Implement your own function that extracts the diagonal entries from a matrix (it should behave like `diag(x)` where `x` is a matrix).

   __[A]{.solved}__: The elements in the diagonal of a matrix have the same row- and column indices. This characteristic can be used to create a suitable numeric matrix used for subsetting.

    ```{r}
    diag2 <- function(x){
      n <- min(nrow(x), ncol(x))
      idx <- cbind(seq_len(n), seq_len(n))
  
      x[idx]
    }

    # Let's check if it works
    (x <- matrix(1:30, 5))

    diag(x)
    diag2(x)
    ```


6. __[Q]{.Q}__: What does `df[is.na(df)] <- 0` do? How does it work?
   
   __[A]{.solved}__: This expression replaces the `NA`s in `df` with `0`. Here `is.na(df)` returns a logical matrix that encodes the position of the missing values in `df`. Subsetting and assignment are then combined to replace only the missing values.
   
## Selecting a single element

1. __[Q]{.Q}__: Brainstorm as many ways as possible to extract the third value from the `cyl` variable in the `mtcars` dataset.

   __[A]{.solved}__: Base R already provides an abundance of possibilities:
    
    ```{r}
    # Select column first
    mtcars$cyl[[3]]
    mtcars[ , "cyl"][[3]]
    mtcars[["cyl"]][[3]]
    with(mtcars, cyl[[3]])
    
    # Select row first
    mtcars[3, ]$cyl
    mtcars[3, "cyl"]
    mtcars[3, ][ , "cyl"]
    mtcars[3, ][["cyl"]]

    # Select simultaneously
    mtcars[3, 2]
    mtcars[[c(2, 3)]]
    ```

2. __[Q]{.Q}__: Given a linear model, e.g., `mod <- lm(mpg ~ wt, data = mtcars)`, extract the residual degrees of freedom. Extract the R squared from the model summary (`summary(mod)`).
   
   __[A]{.solved}__: `mod` has the type list, which opens up several possibilities:
    
    ```{r}
    mod <- lm(mpg ~ wt, data = mtcars)
    
    mod$df.residual       # output preserved
    mod$df.res            # `$` allows partial matching
    mod["df.residual"]    # list output
    mod[["df.residual"]]  # output preserved
    ```
    
   The same also applies to `summary(mod)`, so we could use i.e.:
    
    ```{r, eval = FALSE}
    summary(mod)$r.squared
    ```
    
   (Tip: The `broom`-package provides a very useful approach to work with models in a tidy way).
    
## Applications

1. __[Q]{.Q}__: How would you randomly permute the columns of a data frame? (This is an important technique in random forests). Can you simultaneously permute the rows and columns in one step?
   
   __[A]{.solved}__: This can be achieved by combining `` `[` `` and `sample()`:
    
    ```{r,eval = FALSE}
    # Permute columns
    iris[sample(ncol(iris))]
    
    # Permute columns and rows in one step
    iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE]
    ```

2. __[Q]{.Q}__: How would you select a random sample of `m` rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?
   
   __[A]{.solved}__: Selecting `m` random rows from a data frame can be achieved through subsetting.
    
    ```{r, eval = FALSE}
    m <- 10
    iris[sample(nrow(iris), m), , drop = FALSE]
    ```

   Keeping subsequent rows together as a "[blocked sample](https://mlr.mlr-org.com/articles/tutorial/resample.html#stratification-blocking-and-grouping)" requires only some caution to get the start- and end-index correct.

    ```{r, eval = FALSE}
    start <- sample(nrow(iris) - m + 1, 1)
    end <- start + m - 1
    iris[start:end, , drop = FALSE]
    ```
    
3. __[Q]{.Q}__: How could you put the columns in a data frame in alphabetical order?
   
   __[A]{.solved}__: We combine `order()` with `[`:

    ```{r, eval = FALSE}
    iris[order(names(iris))]
    ```