You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/backtesting.Rmd
+96-93Lines changed: 96 additions & 93 deletions
Original file line number
Diff line number
Diff line change
@@ -11,46 +11,43 @@ vignette: >
11
11
source(here::here("vignettes/_common.R"))
12
12
```
13
13
14
-
```{r pkgs, message=FALSE}
15
-
library(epipredict)
16
-
library(epiprocess)
17
-
library(epidatr)
18
-
library(data.table)
19
-
library(dplyr)
20
-
library(tidyr)
21
-
library(ggplot2)
22
-
library(magrittr)
23
-
library(purrr)
24
-
```
25
-
26
14
Backtesting is a crucial step in the development of forecasting models. It
27
-
involves testing the model on historical data to see how well it performs. This
28
-
is important because it allows us to see how well the model generalizes to new
29
-
data and to identify any potential issues with the model. In the context of
15
+
involves testing the model on historical time periods to see how well it generalizes to new
16
+
data.
17
+
18
+
In the context of
30
19
epidemiological forecasting, to do backtesting accurately, we need to account
31
-
for the fact that the data available at the time of the forecast would have been
32
-
different from the data available at the time of the backtest. This is because
33
-
new data is constantly being collected and added to the dataset, which can
34
-
affect the accuracy of the forecast.
35
-
36
-
For this reason, it is important to use version-faithful forecasting, where the
37
-
model is trained on data that would have been available at the time of the
38
-
forecast. This ensures that the model is tested on data that is as close as
39
-
possible to what would have been available in real-time; training and making
40
-
predictions on finalized data can lead to an overly optimistic sense of accuracy
20
+
for the fact that the data available at _the time of the forecast_ would have been
21
+
different from the data available at the time of the _backtest_.
22
+
This is because
23
+
new data is constantly being collected and added to the dataset, and old data potentially revised.
24
+
Training and making
25
+
predictions only on finalized data can lead to overly optimistic estimates of accuracy
41
26
(see, for example, [McDonald et al.
42
27
(2021)](https://www.pnas.org/content/118/51/e2111453118/) and the references
43
28
therein).
44
29
45
-
In the `{epiprocess}` package, we provide `epix_slide()`, a function that allows
46
-
a convenient way to perform version-faithful forecasting by only using the data as
30
+
In the `{epiprocess}` package, we provide the function `epix_slide()` to help conviently perform version-faithful forecasting by only using the data as
47
31
it would have been available at forecast reference time.
48
32
In this vignette, we will demonstrate how to use `epix_slide()` to backtest an
49
33
auto-regressive forecaster constructed using `arx_forecaster()` on historical
50
34
COVID-19 case data from the US and Canada.
51
35
52
36
# Getting case data from US states into an `epi_archive`
53
37
38
+
```{r pkgs, message=FALSE}
39
+
# Setup
40
+
library(epipredict)
41
+
library(epiprocess)
42
+
library(epidatr)
43
+
library(data.table)
44
+
library(dplyr)
45
+
library(tidyr)
46
+
library(ggplot2)
47
+
library(magrittr)
48
+
library(purrr)
49
+
```
50
+
54
51
First, we create an `epi_archive()` to store the version history of the
55
52
percentage of doctor's visits with CLI (COVID-like illness) computed from
56
53
medical insurance claims and the number of new confirmed COVID-19 cases per
Here, `arx_forecaster()` does all the heavy lifting.
303
-
It creates leads of the target (respecting time stamps and locations) along with
304
-
lags of the features (here, the response and doctors visits), estimates a
298
+
`arx_forecaster()` does all the heavy lifting.
299
+
It creates and lags copies of the features (here, the response and doctors visits),
300
+
creates and leads copies of the target (respecting time stamps and locations), fits a
305
301
forecasting model using the specified engine, creates predictions, and
306
-
non-parametric confidence bands.
302
+
creates non-parametric confidence bands.
307
303
308
-
To see how the predictions compare, we plot them on top of the latest case
309
-
rates, using the same versioning plotting method as above.
310
-
Note that even though we've fitted the model on four states (ca, fl, tx, and
311
-
ny), we'll just display the results for two states, California (CA) and Florida
304
+
To see how the version faithful and un-faithful predictions compare, let's plot them on top of the latest case
305
+
rates, using the same versioned plotting method as above.
306
+
Note that even though we fit the model on four states (California, Texas, Florida, and
307
+
New York), we'll just display the results for two states, California (CA) and Florida
312
308
(FL), to get a sense of the model performance while keeping the graphic simple.
313
309
314
310
<details>
@@ -372,30 +368,37 @@ p2 <-
372
368
p1
373
369
```
374
370
375
-
For California above and Florida below, neither approach produces amazingly
376
-
accurate forecasts.
377
-
The difference between the forecasts produced is quite striking however,
378
-
especially in the single day ahead forecasts.
371
+
The version faithful and un-faithful forecasts look moderately similar except for the 1 day horizons
372
+
(although neither approach produces amazingly accurate forecasts).
379
373
380
-
In the version faithful case for California above, the December 2020 forecast
381
-
starts at nearly 30, whereas the equivalent version faithless forecast at least
382
-
starts at the correct value, even if it is overly pessimistic about the eventual
383
-
peak that will occur.
374
+
In the version faithful case for California, the March 2021 forecast (turquoise)
375
+
starts at a value just above 10, which is very well lined up with reported values leading up to that forecast.
376
+
The measured and forecasted trends are also concordant (both increasingly moderately fast).
384
377
385
-
In the version faithful case for Florida below, the late 2021 forecasts, such as
386
-
September are wildly off base, routinely estimating zero cases, thanks to the
387
-
versioned data systematically under-reporting.
388
-
There is a similar if somewhat less extreme effect on December 2020 as in
389
-
California.
378
+
Because the data for this time period was later adjusted down with a decreasing trend, the March 2021 forecast looks quite bad compared to finalized data.
390
379
391
-
Without using the data as you would have seen it at the time as in the version
392
-
faithful case, you potentially have no insight into what kind of performance you
393
-
can expect in practice.
380
+
The equivalent version un-faithful forecast starts at a value of 5, which is in line with the finalized data but would have been out of place compared to the version data.
394
381
395
382
```{r show-plot2, warning = FALSE, echo=FALSE}
396
383
p2
397
384
```
398
385
386
+
Now let's look at Florida.
387
+
In the version faithful case, the three late-2021 forecasts (purples and pinks) starting in September predict very low values, near 0.
388
+
The trend leading up to each forecast shows a substantial decrease, so these forecasts seem appropriate and we would expect them to score fairly well on various performance metrics when compared to the versioned data.
389
+
390
+
In hindsight, we know that early versions of the data systematically under-reported COVID-related doctor visits such that these forecasts don't actually perform well compared to _finalized_ data.
391
+
In this example, version faithful forecasts predicted values at or near 0 while finalized data shows values in the 5-10 range.
392
+
As a result, the version un-faithful forecasts for these same dates are quite a bit higher, and would perform well when scored using the finalized data and poorly with versioned data.
393
+
394
+
In general, the longer ago a forecast was made, the worse its performance is compared to finalized data. Finalized data accumulates revisions over time that make it deviate more and more from the non-finalized data a model was trained on.
395
+
Forecasts trained solely on finalized data will of course appear to perform better when scored on finalized data, but will have unknown performance on the non-finalized data we need to use if we want timely predictions.
396
+
397
+
Without using data that would have been available on the actual forecast date,
398
+
you have little insight into what level of performance you
399
+
can expect in practice.
400
+
401
+
399
402
400
403
[^1]: For forecasting a single day like this, we could have actually just used
401
404
`doctor_visits |> epix_as_of(forecast_date)` to get the relevant snapshot, and then fed that into `arx_forecaster()` as we did in the [landing
0 commit comments