-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathChapter-09-Notes.Rmd
365 lines (259 loc) · 9.78 KB
/
Chapter-09-Notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
---
title: "Chapter 9: ARIMA Models"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(fpp3)
```
__Basic Concept__: ARIMA models aim to describe the autocorrelations in the data
## 9.1 Stationarity and Differencing
__Stationarity__: a time series is considered stationary when its statistical
properties do not depend on the time at which it is observed. So, any timeseries
with seasonality or trend is not stationary. White noise is stationary.
__Differencing__: computing the differences between consecutive observations. Helps
stabilize the mean of the series. An ACF plot is useful for identifying a non-stationary
time series, but ndiffs() and nsdiffs() (for seasonal differencing) works too.
__Second Order Differencing__: if a series is not stationary after 1 difference,
it may need a second difference.
__Seasonal Differencing__: difference between an obsermation and the previous
observation of the same season. Also termed "lag _m_ differences" where m is the
seasonal period.
__Example__
```{r, fig.height=9, fig.width=8}
PBS %>%
filter(ATC2 == "H02") %>%
summarise(Cost = sum(Cost)/1e6) %>%
transmute(
`Sales ($million)` = Cost,
`Log sales` = log(Cost),
`Annual change in log sales` = difference(log(Cost), 12),
`Doubly differenced log sales` =
difference(difference(log(Cost), 12), 1)
) %>%
pivot_longer(-Month, names_to="Type", values_to="Sales") %>%
mutate(
Type = factor(Type, levels = c(
"Sales ($million)",
"Log sales",
"Annual change in log sales",
"Doubly differenced log sales"))
) %>%
ggplot(aes(x = Month, y = Sales)) +
geom_line() +
facet_grid(vars(Type), scales = "free_y") +
labs(title = "Corticosteroid drug sales", y = NULL)
```
1. Selecting which differences to apply is subjective
2. If both seasonal and non-seasonal differencing is suggested, if the seasonal
strength is high, perform seasonal first.
3. Applying more differences than needed will introduce false dynamics. Perform
as few as necessary.
### Unit Root Tests
The objective way to determine if differencing is required. They can be performed
with `unitroot_kpss` or `unitroot_ndiffs` or `unitroot_nsdiffs` in the `features()`
function. To perform all of them use `features(Your_Data_Here, feature_set(pkgs = "feasts"))`
***
## 9.3 Autoregressive Models
Forecasting the variable of interest using a linear combination of past values
of the variable.
__autoregression__: a regression of the variable against itself
AR models are restricted to stationary data
***
## 9.4 Moving Average Models
Instead of using previous values of the variable in a regression, a MA model
uses past forecast errors in a regression-like model.
Each observation of the data can be thought of as the weighted moving average of
the past few forecast errors.
Not to be confused with moving average smoothing from previous chapters. MA models
forecast future values while MA smoothing is for estimating trends of past values.
***
## 9.5 Non-Seasonal ARIMA Models
ARIMA combines autoregression, differencing, and moving average models. It stands
for AutoRegressive Integrated Moving Average. This combination gives the ability to
model the lagged values and lagged errors as well as make the series stationary.
ARIMA(p,d,q)
* p: order of the autoregression
* d: degree of differencing
* q: order of moving average
`ARIMA()` from the `fable` package can select these parameters for you.
__Example__
```{r}
global_economy %>%
filter(Code == "EGY") %>%
autoplot(Exports) +
labs(y = "% of GDP", title = "Egyptian Exports")
```
```{r}
fit <- global_economy %>%
filter(Code == "EGY") %>%
model(ARIMA(Exports))
report(fit)
```
```{r}
fit %>% forecast(h=10) %>%
autoplot(global_economy) +
labs(y = "% of GDP", title = "Egyptian Exports")
```
Above we see an ARIMA model of (2,0,1) captured the cyclic pattern over decades.
### ACF and PACF plots
In order to manually determine p,d,q starting points, employ ACF plots (for q) and
PACF plots (for p), and then determining based on significant lags (above the blue line).
__ACF__ measures the relationship between $y_t$ and $y_{t-k}$ (lagged values)
__PACF__ measures the relationship between $y_t$ and $y_{t-k}$ after remove the
effects of the lags in between.
__Example__
```{r}
global_economy %>%
filter(Code == "EGY") %>%
gg_tsdisplay(Exports, plot_type = 'partial')
```
If you're going by significant lags, a (4,0,9) model may look like the
recommendation here (last significant lag @ 4 w/ PACF and 9 with ACF), but keep
this in mind:
ARIMA(p,d,0) may be warranted if:
* ACF is exponentially decaying or sinusoidal
* significant spike at lag _p_ in the PACF, but none beyond
ARIMA(0,d,q) may be warranted if:
* PACF is exponentially decaying or sinusoidal
* significant spike at lag _q_ in the ACF, but none beyond
Above we see a decaying sinusoidal in the ACF and the PACF shows the last
sig spike at 4. Starting with ARIMA(4,0,0) would be suggested. To select your
own ARIMA values for `ARIMA` to search within use:
`ARIMA(y ~ pdq(p=1:4, d=1, q=0:2))`
The (4,0,0) model does only slightly worse than the Auto ARIMA function
(with an AICc value of 294.70 compared to 294.29).
***
## 9.6 Estimation and Order Selection
__MLE__: Maximum Likelihood Estimation finds the values of the parameters with
maximize the probability of obtaining the original series.
ARIMA models are much more complicated to estimate than regression models, and
estimation can vary based on the algorithm used.
`fable` will maximize the log likelihood of the data; the log of the probability of the
observed data coming from the model.
__Information Criteria__
AIC, AICc, and BIC are also used (AICc preferred by the textbook) to determine
best orders of p and q, but due to differencing changing the data on which log
likelihood is computed, Information Criteria cannot be compared between models
with different d values.
***
## 9.7 ARIMA Modeling in `fable`
#### How Does `ARIMA()` Work?
1. Differences determined by KPSS tests
2. Values of p and q are chosen by minimizing the AICc after differencing the data.
Algorithm fits four models and uses a step-wise approach minimizing AICc each step.
***
## 9.8 Forecasting
#### Prediction Intervals
It is assumed that residuals are uncorrelated and normally distributed for ARIMA
forecasting. Forecast is unreliable if either is untrue. Always plot ACF and
histogram of residuals.
If residuals uncorrelated but not normally distributed, obtain bootstrapped intervals
by using `bootstrap=TRUE` in `forecast()`
***
## 9.9 Seasonal ARIMA Models
For seasonal ARIMA models additional seasonal terms $(P,D,Q)m$ are included, where
$m$ is the seasonal period.
#### Seasonal ARIMA Modeling Procedure
__Example__
Data:
```{r}
leisure <- us_employment %>%
filter(Title == "Leisure and Hospitality",
year(Month) > 2000) %>%
mutate(Employed = Employed/1000) %>%
select(Month, Employed)
autoplot(leisure, Employed) +
labs(title = "US employment: leisure and hospitality",
y="Number of people (millions)")
```
Non-stationary data with clear seasonality and trend.
1. Take seasonal difference. It is a yearly trend, so we take `difference(12)` below
```{r}
leisure %>%
gg_tsdisplay(difference(Employed, 12),
plot_type='partial', lag=36) +
labs(title="Seasonally differenced", y="")
```
This is also clearly non-stationary with non-normal residuals and very significant
lags.
2. Further differencing required.
```{r}
leisure %>%
gg_tsdisplay(difference(Employed, 12) %>% difference(),
plot_type='partial', lag=36) +
labs(title = "Double differenced", y="")
```
3. Now with residuals centering around zero, we look to the ACF and PACF to
define our terms.
* ACF: Significant spike at lag 2 suggests non-seasonal MA(2)
* ACF: Significant spike at lag 12 suggests a seasonal MA(1) component
This leads to a ARIMA(0,1,2)(0,1,1)12 model
* PACF: Significant spike at lag 2 suggests a non-seasonal AR(2)
* Still use the ACF to determine the seasonal component.
This leads to a ARIMA(2,1,0)(0,1,1)12
Also including the auto ARIMA function setting `stepwise=FALSE` and `approximation=FALSE`
to make R work extra to find the best model.
```{r}
fit <- leisure %>%
model(
arima012011 = ARIMA(Employed ~ pdq(0,1,2) + PDQ(0,1,1)),
arima210011 = ARIMA(Employed ~ pdq(2,1,0) + PDQ(0,1,1)),
auto = ARIMA(Employed, stepwise = FALSE, approx = FALSE)
)
```
```{r}
glance(fit) %>% arrange(AICc) %>% select(.model:BIC)
```
We see here the auto ARIMA performs best via
Finding the order and residuals of the auto ARIMA:
```{r}
fit %>%
select(auto) %>%
report()
fit %>%
select(auto) %>%
gg_tsresiduals(lag=36)
```
One small but significant spike (lag 11) is still consistent with white noise.
4. Verify residuals are white noise
```{r}
augment(fit) %>% features(.innov, ljung_box, lag=24, dof=4)
```
The model residuals do not differentiate from white noise.
5. Forecast
```{r}
forecast(fit, h=36) %>%
filter(.model=='auto') %>%
autoplot(leisure) +
labs(title = "US employment: leisure and hospitality",
y="Number of people (millions)")
```
***
## 9.10 ARIMA vs ETS
To compare ARIMA and ETS models use time series cross-validation or train/test split.
Using CV:
__Example__
```{r}
aus_economy <- global_economy %>%
filter(Code == "AUS") %>%
mutate(Population = Population/1e6)
aus_economy %>%
slice(-n()) %>%
stretch_tsibble(.init = 10) %>%
model(
ETS(Population),
ARIMA(Population)
) %>%
forecast(h = 1) %>%
accuracy(aus_economy) %>%
select(.model, RMSE:MAPE)
```
```{r}
aus_economy %>%
model(ETS(Population)) %>%
forecast(h = "5 years") %>%
autoplot(aus_economy %>% filter(Year >= 2000)) +
labs(title = "Australian population",
y = "People (millions)")
```