forked from Tazinho/Advanced-R-Solutions
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2-10-Function_factories.Rmd
executable file
·410 lines (299 loc) · 14.8 KB
/
2-10-Function_factories.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
```{r, include=FALSE}
source("common.R")
```
# Function factories
## Prerequisites
For most of this chapter base R is sufficient. Just a few exercises require **rlang** and **ggplot2** packages.
```{r}
library(rlang)
library(ggplot2)
```
## Factory fundamentals
1. __[Q]{.Q}__: The definition of `force()` is simple:
```{r}
force
```
Why is it better to `force(x)` instead of just `x`?
__[A]{.solved}__: As you can see `force(x)` is just syntactic sugar for `x`. We prefer this explicit form, because
> using this function clearly indicates that you’re forcing evaluation, not that you’ve accidentally typed `x`. (Quote from the textbook)
2. __[Q]{.Q}__: Base R contains two function factories, `approxfun()` and `ecdf()`. Read their documentation and experiment to figure out what the functions do and what they return.
__[A]{.solved}__: Let's begin with `approxfun()` as it is used within `ecdf()` also:
- `approxfun()` takes a 2-dimensional combination of data points (`x` and `y`) as input and returns a *stepwise interpolation function*, which transforms new `x` values. Additional arguments control how the created function should behave. (The interpolation `method` may be linear or constant. `yleft`, `yright` and `rule` specify how the newly created function should map new values which are outside of `range(x)`. `f` controls the degree of right-left-continuity via a numeric value from `0` to `1` and `ties` expects a function name like min, mean etc. which defines how non-unique x-y-combinations should be handled when interpolating the data points.)
- `ecdf()` is an acronym for empirical cumulative distribution function. For a numeric vector, `ecdf()` returns the appropriate distribution function (of class “ecdf”, which is inheriting from class “stepfun”). Initially the (x, y) pairs for the nodes of the density function are calculated. Afterwards these pairs are passed to `approxfun()`, which then returns the desired function.
<!-- HW: I think a couple of examples here would help -->
3. __[Q]{.Q}__: Create a function `pick()` that takes an index, `i`, as an argument and returns a function with an argument `x` that subsets `x` with `i`.
```{r, eval = FALSE}
pick(1)(x)
# should be equivalent to
x[[1]]
lapply(mtcars, pick(5))
# should be equivalent to
lapply(mtcars, function(x) x[[5]])
```
__[A]{.solved}__: In this exercise `pick(i)` acts as a function factory, which returns the required subsetting function.
```{r}
pick <- function(i) {
force(i)
function(x) x[[i]]
}
x <- 1:3
identical(x[[1]], pick(1)(x))
identical(lapply(mtcars, function(x) x[[5]]),
lapply(mtcars, pick(5)))
```
4. __[Q]{.Q}__: Create a function that creates functions that compute the i^th^ [central moment](http://en.wikipedia.org/wiki/Central_moment) of a numeric vector. You can test it by running the following code:
```{r, eval = FALSE}
m1 <- moment(1)
m2 <- moment(2)
x <- runif(100)
stopifnot(all.equal(m1(x), 0))
stopifnot(all.equal(m2(x), var(x) * 99 / 100))
```
__[A]{.solved}__: The first moment is closely related to the mean and describes the average deviation from the mean, which is 0 (within numerical margin of error). The second moment describes the variance of the input data. If we want compare it to `var`, we need to undo [Bessel's correction}(https://en.wikipedia.org/wiki/Bessel%27s_correction) correction by multiplying with $\frac{N-1}{N}$.
```{r}
moment <- function(i){
force(i)
function(x) sum((x - mean(x)) ^ i) / length(x)
}
m1 <- moment(1)
m2 <- moment(2)
x <- runif(100)
all.equal(m1(x), 0) # removed stopifnot() for clarity
all.equal(m2(x), var(x) * 99 / 100)
```
5. __[Q]{.Q}__: What happens if you don't use a closure? Make predictions, then verify with the code below.
```{r}
i <- 0
new_counter2 <- function() {
i <<- i + 1
i
}
```
__[A]{.solved}__: Without the captured and encapsulated environment of a closure the counts will be stored in the global environment. Here they can be overwritten or deleted as well as interfere with other counters.
```{r, error = TRUE}
new_counter2()
i
new_counter2()
i
i <- 0
new_counter2()
i
```
6. __[Q]{.Q}__: What happens if you use `<-` instead of `<<-`? Make predictions, then verify with the code below.
```{r}
new_counter3 <- function() {
i <- 0
function() {
i <- i + 1
i
}
}
```
__[A]{.solved}__: Without the super assignment `<<-`, the counter will always return 1. The counter always starts in a new execution environment within the same enclosing environment, which contains an unchanged value for `i` (in this case it remains 0).
```{r}
new_counter_3 <- new_counter3()
new_counter_3()
new_counter_3()
```
## Graphical factories
1. __[Q]{.Q}__: Compare and contrast `ggplot2::label_bquote()` with `scales::number_format()`.
__[A]{.solved}__:
<!-- HW: not sure what I was thinking here, but I'd focus more on the results than the implementation. i.e. they're both function factories designed to create nice labels; `label_bquote()` is used with facets, and adds mathematical notation; `number_format()` is used with axes/legends, and adds commas etc -->
## Statistical factories
1. __[Q]{.Q}__: In `boot_model()`, why don't I need to force the evaluation of `df` or `model`?
__[A]{.solved}__: `boot_model()` ultimately returns a function, and whenever you return a function you need to make sure all the inputs are explicitly evaluated. Here that happens automatically because we use `df` and `formula` in `lm()`.
```{r}
boot_model <- function(df, formula) {
mod <- lm(formula, data = df)
fitted <- unname(fitted(mod))
resid <- unname(resid(mod))
rm(mod)
function() {
fitted + sample(resid)
}
}
```
2. __[Q]{.Q}__: Why might you formulate the Box-Cox transformation like this?
```{r}
boxcox3 <- function(x) {
function(lambda) {
if (lambda == 0) {
log(x)
} else {
(x ^ lambda - 1) / lambda
}
}
}
```
__[A]{.started}__: `boxcox3` returns a function where `x` is fixed (though it is not forced, so it may manipulated later). This allows us to apply and test different transformations for different inputs and give them a descriptive name.
```{r, out.width = "45%", fig.show = "hold", fig.align = "none"}
# initial example (should be improved)
boxcox_airpassengers <- boxcox3(AirPassengers)
plot(boxcox_airpassengers(0))
plot(boxcox_airpassengers(1))
plot(boxcox_airpassengers(2))
```
3. __[Q]{.Q}__: Why don't you need to worry that `boot_permute()` stores a copy of the data inside the function that it generates?
__[A]{.solved}__: Because it doesn't actually store a copy; it's just a name that points to the same underlying object in memory.
```{r}
boot_permute <- function(df, var) {
n <- nrow(df)
force(var)
function() {
col <- df[[var]]
col[sample(n, replace = TRUE)]
}
}
boot_mtcars1 <- boot_permute(mtcars, "mpg")
lobstr::obj_size(mtcars)
lobstr::obj_size(boot_mtcars1)
lobstr::obj_sizes(mtcars, boot_mtcars1)
```
4. __[Q]{.Q}__: How much time does `ll_poisson2()` save compared to `ll_poisson1()`? Use `bench::mark()` to see how much faster the optimisation occurs. How does changing the length of `x` change the results?
__[A]{.solved}__: Let us recall the definitions of `ll_poisson1()` and `ll_poisson2()` and the test data `x1`:
```{r}
ll_poisson1 <- function(x) {
n <- length(x)
function(lambda) {
log(lambda) * sum(x) - n * lambda - sum(lfactorial(x))
}
}
ll_poisson2 <- function(x) {
n <- length(x)
sum_x <- sum(x)
c <- sum(lfactorial(x))
function(lambda) {
log(lambda) * sum_x - n * lambda - c
}
}
# provided test data
x1 <- c(41, 30, 31, 38, 29, 24, 30, 29, 31, 38)
```
A benchmark with this data reveals a performance improvement of factor 2 for `ll_poisson2()` over `ll_poisson1()`
```{r}
bench::mark(
llp1 = optimise(ll_poisson1(x1), c(0, 100), maximum = TRUE),
llp2 = optimise(ll_poisson2(x1), c(0, 100), maximum = TRUE)
)
```
Regarding differing lengths of `x1`, we expect even further performance improvements of `ll_poisson2()` compared to `ll_poisson1()`, as the redundant calculations within `ll_poisson1()`, become more expensive with growing length of `x1`. The following results imply that for a length of `x1` of 100000, `ll_poisson2()` is about 20+ times as fast as `ll_poisson1()`:
```{r, message = FALSE, warning = FALSE}
library(purrr)
library(dplyr)
bench_poisson <- function(i){
x_i_length <- 10L ^ i
x_i <- rpois(x_i_length, 100L)
rel_advantage <- bench::mark(llp1 = optimise(ll_poisson1(x_i), c(0, 100), maximum = TRUE),
llp2 = optimise(ll_poisson2(x_i), c(0, 100), maximum = TRUE),
relative = TRUE)$median %>%
max()
rel_advantage
}
bench_df <- map_dbl(1:5, bench_poisson) %>%
tibble(i = 1:5,
rel_advantage = .,
x_length = 10 ^ i)
bench_df %>%
ggplot(aes(x_length, rel_advantage)) +
geom_point() +
geom_line() +
ggtitle("Rel. Speed of ll_poisson2() increases with vector length") +
scale_x_log10()
```
## Function factories + functionals
1. __[Q]{.Q}__: Which of the following commands is equivalent to `with(x, f(z))`?
(a) `x$f(x$z)`.
(b) `f(x$z)`.
(c) `x$f(z)`.
(d) `f(z)`.
(e) It depends.
__[A]{.solved}__: (e) "It depends" is the correct answer. Usually `with()` is used with a data frame, so you'd usually expect (b), but if `x` is a list, it could be any of the options.
```{r}
f <- mean
z <- 1
x <- list(f = mean, z = 1)
identical(with(x, f(z)), x$f(x$z))
identical(with(x, f(z)), f(x$z))
identical(with(x, f(z)), x$f(z))
identical(with(x, f(z)), f(z))
```
2. __[Q]{.Q}__: Compare and contrast the effects of `env_bind()` vs. `attach()` for the following code.
```{r}
funs <- list(
mean = function(x) mean(x, na.rm = TRUE),
sum = function(x) sum(x, na.rm = TRUE)
)
attach(funs)
mean <- function(x) stop("Hi!")
detach(funs)
env_bind(globalenv(), !!!funs)
mean <- function(x) stop("Hi!")
env_unbind(globalenv(), names(funs))
```
__[A]{.solved}__: `attach()` adds `funs` to the search path. Therefore, the provided functions are found before their respective versions from the base package. Further, they can not get accidently overwritten by similar named functions in the global environment. One annoying downsinde of using `attach()` is the possibility to attach the same object multiple times, making it necessary to call `detach()` equally often.
In contrast `rlang::env_bind()` just adds the functions in `fun` to the global environment. No further side effects are introduced and the functions are overwritten when similarly named functions are defined.
## Old exercises
### Closures
1. __[Q]{.Q}__: Why are functions created by other functions called closures?
__[A]{.solved}__: As stated in the book:
> because they enclose the environment of the parent function and can access all its variables.
2. __[Q]{.Q}__: What does the following statistical function do? What would be a better
name for it? (The existing name is a bit of a hint.)
```{r}
bc <- function(lambda) {
if (lambda == 0) {
function(x) log(x)
} else {
function(x) (x ^ lambda - 1) / lambda
}
}
```
__[A]{.solved}__: It is the logarithm, when lambda equals zero and `x ^ lambda - 1 / lambda` otherwise. A better name might be `box_cox_transformation` (one parametric), you can read about it (here)[https://en.wikipedia.org/wiki/Power_transform].
3. __[Q]{.Q}__: What does `approxfun()` do? What does it return?
__[A]{.solved}__: `approxfun` basically takes a combination of 2-dimensional data points + some extra specifications as arguments and returns a stepwise linear or constant interpolation function (defined on the range of given x-values, by default).
4. __[Q]{.Q}__: What does `ecdf()` do? What does it return?
__[A]{.solved}__: "ecdf" means empirical density function. For a numeric vector, `ecdf()` returns the appropriate density function (of class "ecdf", which is inheriting from class "stepfun"). You can describe it's behaviour in 2 steps. In the first part of it's body, the `(x,y)` pairs for the nodes of the density function are calculated. In the second part these pairs are given to `approxfun`.
5. __[Q]{.Q}__: Create a function that creates functions that compute the ith
[central moment](http://en.wikipedia.org/wiki/Central_moment) of a numeric
vector. You can test it by running the following code:
```{r, eval = FALSE}
m1 <- moment(1)
m2 <- moment(2)
x <- runif(100)
stopifnot(all.equal(m1(x), 0))
stopifnot(all.equal(m2(x), var(x) * 99 / 100))
```
__[A]{.solved}__: For a discrete formulation look [here](http://www.r-tutor.com/elementary-statistics/numerical-measures/moment)
```{r, eval = FALSE}
moment <- function(i){
function(x) sum((x - mean(x)) ^ i) / length(x)
}
```
6. __[Q]{.Q}__: Create a function `pick()` that takes an index, `i`, as an argument and
returns a function with an argument `x` that subsets `x` with `i`.
```{r, eval = FALSE}
lapply(mtcars, pick(5))
# should do the same as this
lapply(mtcars, function(x) x[[5]])
```
__[A]{.solved}__:
```{r, eval = FALSE}
pick <- function(i){
function(x) x[[i]]
}
stopifnot(identical(lapply(mtcars, pick(5)),
lapply(mtcars, function(x) x[[5]]))
)
```
### Case study: numerical integration
1. __[Q]{.Q}__: Instead of creating individual functions (e.g., `midpoint()`,
`trapezoid()`, `simpson()`, etc.), we could store them in a list. If we
did that, how would that change the code? Can you create the list of
functions from a list of coefficients for the Newton-Cotes formulae?
__[A]{.solved}__:
2. __[Q]{.Q}__: The trade-off between integration rules is that more complex rules are
slower to compute, but need fewer pieces. For `sin()` in the range
[0, $\pi$], determine the number of pieces needed so that each rule will
be equally accurate. Illustrate your results with a graph. How do they
change for different functions? `sin(1 / x^2)` is particularly challenging.
__[A]{.solved}__: