forked from Tazinho/Advanced-R-Solutions
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2-03-Vectors.Rmd
executable file
·355 lines (242 loc) · 13.5 KB
/
2-03-Vectors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
```{r, include=FALSE}
source("common.R")
```
# Vectors
## Atomic vectors
1. __[Q]{.Q}__: How do you create scalars of type raw and complex? (See `?raw` and `?complex`)
<!-- I'd say there's no way to create a literal raw scalar; you have to create a vector. You can't create complex scalars, but you can create imaginary scalars with the i prefix: i, 5i etc. You could mention the complex() and raw() constructors in the next paragraph. -->
__[A]{.solved}__: In R scalars are represented as vectors of length one. For raw and complex types these can be created via `raw()` and `complex()`, i.e.:
```{r}
raw(1)
complex(1)
```
Raw vectors can easily be created from numeric or character values.
```{r}
as.raw(42)
charToRaw("A")
```
For complex numbers real and imaginary parts may be provided directly.
```{r}
complex(length.out = 1, real = 1, imaginary = 1)
```
2. __[Q]{.Q}__: Test your knowledge of vector coercion rules by predicting the output of the following uses of `c()`:
```{r, eval=FALSE}
c(1, FALSE) # will be coerced to numeric -> 1 0
c("a", 1) # will be coerced to character -> "a" "1"
c(TRUE, 1L) # will be coerced to integer -> 1 1
```
3. __[Q]{.Q}__: Why is `1 == "1"` true? Why is `-1 < FALSE` true? Why is `"one" < 2` false?
__[A]{.solved}__: These comparisons are carried out by operator-functions, which coerce their arguments to a common type. In the examples above these cases will be character, double and character: `1` will be coerced to `"1"`, `FALSE` is represented as `0` and `2` turns into `"2"` (and numerals precede letters in the lexicographic order (may depend on locale)).
4. __[Q]{.Q}__: Why is the default missing value, `NA`, a logical vector? What's special about logical vectors? (Hint: think about `c(FALSE, NA_character_)`.)
__[A]{.solved}__: The presence of missing values shouldn´t affect the type of an object. Recall that there is a type-hierarchy for coercion from character >> double >> integer >> logical. When combining `NA`s with other atomic types, the `NA`s will be coerced to integer (`NA_integer_`), double (`NA_real_`) or character (`NA_character_`) and not the other way round. If `NA` was a character and added to a set of other values all of these would be coerced to character as well.
5. __[Q]{.Q}__: Precisely what do `is.atomic()`, `is.numeric()`, and `is.vector()` test for?
__[A]{.solved}__: The documentation states that:
- `is.atomic()` tests if is an atomic vector (as defined in Advanced R) or is `NULL` (!).
- `is.numeric()` tests if an object has type integer or double and is not of `"factor"`, `"Date"`, `"POSIXt"` or `"difftime"` class.
- `is.vector()` tests if an object is vector (as defined in Advanced R) and has no attributes, apart from names.
## Attributes
1. __[Q]{.Q}__: How is `setNames()` implemented? How is `unname()` implemented? Read the source code.
__[A]{.solved}__: `setNames()` is implemented as:
```{r, eval = FALSE}
setNames <- function (object = nm, nm){
names(object) <- nm
object
}
```
Because the data argument comes first `setNames()` also works well with the magrittr-pipe operator. When no first argument is given, the result is a named vector (this is rather untypical - required arguments usually come first):
```{r}
setNames( , c("a", "b", "c"))
```
`unname()` is implemented in the following way:
```{r, eval = FALSE}
unname <- function (obj, force = FALSE){
if (!is.null(names(obj)))
names(obj) <- NULL
if (!is.null(dimnames(obj)) && (force || !is.data.frame(obj)))
dimnames(obj) <- NULL
obj
}
```
`unname()` removes existing names (or dimnames) by setting them to `NULL`.
2. __[Q]{.Q}__: What does `dim()` return when applied to a 1d vector? When might you use `NROW()` or `NCOL()`?
__[A]{.solved}__: From `?nrow`:
> `dim()` will return `NULL` when applied to a 1d vector.
One may want to use `NROW()` or `NCOL()` to handle atomic vectors, lists and NULL values in the same way as one column matrices or data frames. For these objects `nrow()` and `ncol()` return `NULL`.
```{r}
x <- 1:10
# return NULL
nrow(x)
ncol(x)
# Pretend it's a column-vector
NROW(x)
NCOL(x)
```
3. __[Q]{.Q}__: How would you describe the following three objects? What makes them different to `1:5`?
```{r}
x1 <- array(1:5, c(1, 1, 5)) # 1 row, 1 column, 5 in third dimension
x2 <- array(1:5, c(1, 5, 1)) # 1 row, 5 columns, 1 in third dimension
x3 <- array(1:5, c(5, 1, 1)) # 5 rows, 1 column, 1 in third dimension
```
__[A]{.solved}__: These are all "one dimensional".
If you imagine a 3d cube, `x1` is in "x" dimension, `x2` is in the "y"
dimension, and `x3` is in the "z" dimension.
4. __[Q]{.Q}__: An early draft used this code to illustrate `structure()`:
```{r}
structure(1:5, comment = "my attribute")
```
But when you print that object you don't see the comment attribute. Why? Is the attribute missing, or is there something else special about it? (Hint: try using help.) \index{attributes!comment}
__[A]{.solved}__: The documentation states (see `?comment`):
> Contrary to other attributes, the comment is not printed (by print or print.default).
Also, from `?attributes`:
> Note that some attributes (namely class, comment, dim, dimnames, names, row.names and tsp) are treated specially and have restrictions on the values which can be set.
We can retrieve comment attributes by calling them explicitly:
```{r}
foo <- structure(1:5, comment = "my attribute")
attributes(foo)
attr(foo, which = "comment")
```
## S3 atomic vectors
1. __[Q]{.Q}__: What sort of object does `table()` return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?
__[A]{.solved}__: `table()` returns a contingency table of its input variables, which has the class `"table"`. Internally it is represented as an array (implicit class) of integers (type) with the attributes `dim` (dimension of the underlying array) and `dimnames` (one name for each input column). The dimensions correspond to the number of unique values (factor levels) in each input variable.
```{r}
x <- table(mtcars[c("vs", "cyl", "am")])
typeof(x)
attributes(x)
```
2. __[Q]{.Q}__: What happens to a factor when you modify its levels?
```{r, eval = FALSE}
f1 <- factor(letters)
levels(f1) <- rev(levels(f1))
```
__[A]{.solved}__: The underlying integer values stay the same, but the levels are changed, making it look like the data as changed.
```{r}
f1 <- factor(letters[1:10])
levels(f1)
f1
as.integer(f1)
levels(f1) <- rev(levels(f1))
levels(f1)
f1
as.integer(f1)
```
3. __[Q]{.Q}__: What does this code do? How do `f2` and `f3` differ from `f1`?
```{r, results = "none"}
f2 <- rev(factor(letters)) # reverses element order (only)
f3 <- factor(letters, levels = rev(letters)) # reverses factor level order (only)
```
__[A]{.solved}__: For `f2` and `f3` either the order of the factor elements *or* its levels are being reversed. For `f1` both transformations are occurring.
## Lists
1. __[Q]{.Q}__: List all the ways that a list differs from an atomic vector.
__[A]{.solved}__: To summarise:
<!-- Would be good to link these to sections in Advanced R -->
- Atomic vectors are always homogeneous (all elements must be of the same type). Lists may be heterogeneous (the elements can be of different types).
- Atomic vectors point to one address in memory, while lists contain a separate references for each element.
```{r}
lobstr::ref(1:2)
lobstr::ref(list(1:2, 2))
```
- Subsetting with out of bound values or `NA`s leads to `NA`s for atomics and `NULL` values for lists.
```{r}
# Subsetting atomic vectors
(1:2)[3]
(1:2)[NA]
# Subsetting lists
as.list(1:2)[3]
as.list(1:2)[NA]
```
2. __[Q]{.Q}__: Why do you need to use `unlist()` to convert a list to an atomic vector? Why doesn't `as.vector()` work?
__[A]{.solved}__: A list is already a vector, though not an atomic one!
Note that `as.vector()` and `is.vector()` use different defintions of
"vector"!
```{r}
is.vector(as.vector(mtcars))
```
3. __[Q]{.Q}__: Compare and contrast `c()` and `unlist()` when combining a date and date-time into a single vector.
__[A]{.solved}__: Date and date-time objects are built upon doubles. Dates are represented as days, while date-time-objects (POSIXct) represent seconds (counted in respect to the reference date 1970-01-01, also known as "The Epoch").
Combining these objects leads to surprising output because `c()` does not consider the class of both inputs:
```{r}
date <- as.Date("1970-01-02")
dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "UTC")
c(date, dttm_ct) # equal to c.Date(date, dttm_ct)
c(dttm_ct, date) # equal to c.POSIXct(date, dttm_ct)
```
The generic function dispatches based on the class of its first argument. When `c.Date()` is executed, `dttm_ct` is converted to a date, but the 3600 seconds are mistaken for 3600 days! When `c.POSIXct()` is called on `date`, one day counts as one second only, as illustrated by the following line:
```{r}
unclass(c(date, dttm_ct)) # internal representation
date + 3599
```
Some of these problems may be avoided via explicit conversion of the classes:
```{r}
c(as.Date(dttm_ct, tz = "UTC"), date)
```
Let's look at `unlist()`, which operates on list input.
```{r}
# attributes are stripped
unlist(list(date, dttm_ct))
```
We see that internally dates(-times) are stored as doubles. Unfortunately this is all we are left with, when unlist strips the attributes of the list.
To summarise: `c()` coerces types and errors may occur because of inappropriate method dispatch. `unlist()` strips attributes.
<!-- link to vctrs package for resolution of this problem? -->
## Data frames and tibbles
1. __[Q]{.Q}__: Can you have a data frame with 0 rows? What about 0 columns?
__[A]{.solved}__: Yes, you can create these data frames easily and in many ways. Even both dimensions can be 0. E.g. you might subset the respective dimension with either `0`, `NULL` or a valid 0-length atomic (`logical(0)`, `character(0)`, `integer(0)`, `double(0)`). Negative integer sequences would also work. The following example uses a zero:
```{r}
iris[0, ]
iris[, 0] # or iris[0]
iris[0, 0]
```
Empty data frames can also be created directly (without subsetting):
```{r}
data.frame()
```
2. __[Q]{.Q}__: What happens if you attempt to set rownames that are not unique?
__[A]{.solved}__ Matrices can have duplicated row names, so this does now cause problems
Data frames, however, required unique rownames and you get different results depending on how you attempt to set them. If you use `row.names()` directly, you
get an error:
```{r, error = TRUE}
df <- data.frame(x = 1:3)
row.names(df) <- c("x", "y", "y")
```
If you use subsetting, `[` automatically deduplicates:
```{r}
row.names(df) <- c("x", "y", "z")
df[c(1, 1, 1), , drop = FALSE]
```
<!-- I think discussing `.rowNamesDF<-` is going too deep -->
3. __[Q]{.Q}__: If `df` is a data frame, what can you say about `t(df)`, and `t(t(df))`? Perform some experiments, making sure to try different column types.
__[A]{.solved}__ Both will return matrices:
```{r}
df <- data.frame(x = 1:5, y = 5:1)
is.matrix(df)
is.matrix(t(df))
is.matrix(t(t(df)))
```
Whose dimensions respect the typical transposition rules:
```{r}
dim(df)
dim(t(df))
dim(t(t(df)))
```
Because the output is a matrix, every column is coerced to the same type by `as.matrix()`, as described below.
4. __[Q]{.Q}__: What does `as.matrix()` do when applied to a data frame with columns of different types? How does it differ from `data.matrix()`?
__[A]{.solved}__: From `?as.matrix`:
> The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g., all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give a integer matrix, etc.
Let´s transform a dummy data frame into a character matrix. Note that `format()` is applied to the characters, which gives surprising results: `TRUE` is transformed to `" TRUE"` (starting with a space!).
```{r}
df_coltypes <- data.frame(
a = c("a", "b"),
b = c(TRUE, FALSE),
c = c(1L, 0L),
d = c(1.5, 2),
e = c("one" = 1, "two" = 2),
g = factor(c("f1", "f2")),
stringsAsFactors = FALSE
)
as.matrix(df_coltypes)
```
From `?as.data.matrix`:
> Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes.
`data.matrix()` returns a numeric matrix, where characters are replace by missing values:
```{r}
data.matrix(df_coltypes)
```