Skip to content

Commit af932cf

Browse files
committed
backtesting.rmd
1 parent b786750 commit af932cf

File tree

1 file changed

+96
-93
lines changed

1 file changed

+96
-93
lines changed

vignettes/backtesting.Rmd

Lines changed: 96 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -11,46 +11,43 @@ vignette: >
1111
source(here::here("vignettes/_common.R"))
1212
```
1313

14-
```{r pkgs, message=FALSE}
15-
library(epipredict)
16-
library(epiprocess)
17-
library(epidatr)
18-
library(data.table)
19-
library(dplyr)
20-
library(tidyr)
21-
library(ggplot2)
22-
library(magrittr)
23-
library(purrr)
24-
```
25-
2614
Backtesting is a crucial step in the development of forecasting models. It
27-
involves testing the model on historical data to see how well it performs. This
28-
is important because it allows us to see how well the model generalizes to new
29-
data and to identify any potential issues with the model. In the context of
15+
involves testing the model on historical time periods to see how well it generalizes to new
16+
data.
17+
18+
In the context of
3019
epidemiological forecasting, to do backtesting accurately, we need to account
31-
for the fact that the data available at the time of the forecast would have been
32-
different from the data available at the time of the backtest. This is because
33-
new data is constantly being collected and added to the dataset, which can
34-
affect the accuracy of the forecast.
35-
36-
For this reason, it is important to use version-faithful forecasting, where the
37-
model is trained on data that would have been available at the time of the
38-
forecast. This ensures that the model is tested on data that is as close as
39-
possible to what would have been available in real-time; training and making
40-
predictions on finalized data can lead to an overly optimistic sense of accuracy
20+
for the fact that the data available at _the time of the forecast_ would have been
21+
different from the data available at the time of the _backtest_.
22+
This is because
23+
new data is constantly being collected and added to the dataset, and old data potentially revised.
24+
Training and making
25+
predictions only on finalized data can lead to overly optimistic estimates of accuracy
4126
(see, for example, [McDonald et al.
4227
(2021)](https://www.pnas.org/content/118/51/e2111453118/) and the references
4328
therein).
4429

45-
In the `{epiprocess}` package, we provide `epix_slide()`, a function that allows
46-
a convenient way to perform version-faithful forecasting by only using the data as
30+
In the `{epiprocess}` package, we provide the function `epix_slide()` to help conviently perform version-faithful forecasting by only using the data as
4731
it would have been available at forecast reference time.
4832
In this vignette, we will demonstrate how to use `epix_slide()` to backtest an
4933
auto-regressive forecaster constructed using `arx_forecaster()` on historical
5034
COVID-19 case data from the US and Canada.
5135

5236
# Getting case data from US states into an `epi_archive`
5337

38+
```{r pkgs, message=FALSE}
39+
# Setup
40+
library(epipredict)
41+
library(epiprocess)
42+
library(epidatr)
43+
library(data.table)
44+
library(dplyr)
45+
library(tidyr)
46+
library(ggplot2)
47+
library(magrittr)
48+
library(purrr)
49+
```
50+
5451
First, we create an `epi_archive()` to store the version history of the
5552
percentage of doctor's visits with CLI (COVID-like illness) computed from
5653
medical insurance claims and the number of new confirmed COVID-19 cases per
@@ -78,13 +75,13 @@ doctor_visits <- pub_covidcast(
7875
time_values = epirange(20200601, 20211201),
7976
issues = epirange(20200601, 20211201)
8077
) |>
78+
# The version date column is called `issue` in the Epidata API. Rename it.
8179
rename(version = issue, percent_cli = value) |>
8280
as_epi_archive(compactify = TRUE)
8381
```
8482

85-
`issues` is the name for `version` in the Epidata API.
86-
In the interest of computational speed, we only use the 4 state dataset limited
87-
to 2020-2021, but the full archive can be used in the same way and has performed
83+
In the interest of computational speed, we limit the dataset to 4 states and
84+
2020-2021, but the full archive can be used in the same way and has performed
8885
well in the past.
8986

9087
We choose this dataset in particular partly because it is revision heavy; for
@@ -104,11 +101,11 @@ percent_cli_data <- bind_rows(
104101
mutate(version = .x)
105102
) |>
106103
bind_rows() |>
107-
mutate(version_faithful = TRUE),
108-
# Latest data for the version-faithless forecasts
104+
mutate(version_faithful = "Version faithful"),
105+
# Latest data for the version-un-faithful forecasts
109106
doctor_visits |>
110107
epix_as_of(doctor_visits$versions_end) |>
111-
mutate(version_faithful = FALSE)
108+
mutate(version_faithful = "Version un-faithful")
112109
)
113110
114111
p0 <-
@@ -125,18 +122,18 @@ p0 <-
125122
```
126123
</details>
127124

128-
```{r plot_just_revisioning, warn = FALSE}
125+
```{r plot_just_revisioning, warn = FALSE, message = FALSE}
129126
p0
130127
```
131128

132129
The snapshots are taken on the first of each month, with the vertical dashed
133-
line representing the sample date for the time series of the corresponding
130+
line representing the issue date for the time series of the corresponding
134131
color.
135132
For example, the snapshot on March 1st, 2021 is aquamarine, and increases to
136133
slightly over 10.
137134
Every series is necessarily to the left of the snapshot date (since all known
138135
values must happen before the snapshot is taken[^4]).
139-
The grey line at the center of all the various snapshots represents the "final
136+
The grey line overlaying the various snapshots represents the "final
140137
value", which is just the snapshot at the last version in the archive (the
141138
`versions_end`).
142139

@@ -146,12 +143,12 @@ The drop in January 2021 in the snapshot on `2021-02-01` was initially reported
146143
as much steeper than it eventually turned out to be, while in the period after
147144
that the values were initially reported as higher than they actually were.
148145

149-
A feature that is common in both real-time forecasting and retrospective
150-
forecasting is data latency.
146+
Handling data latency is important in both real-time forecasting and retrospective
147+
forecasting.
151148
Looking at the very first snapshot, `2020-08-01` (the red dotted
152149
vertical line), there is a noticeable gap between the forecast date and the end
153150
of the red time-series to its left.
154-
In fact, if we take a snapshot and get the last `time_value`:
151+
In fact, if we take a snapshot and get the last `time_value`,
155152

156153
```{r}
157154
doctor_visits |>
@@ -160,7 +157,7 @@ doctor_visits |>
160157
max()
161158
```
162159

163-
The last day of data is the 25th, a entire week before `2020-08-01`.
160+
the last day of data is the 25th, a entire week before `2020-08-01`.
164161
This can require some effort to work around, especially if the latency is
165162
variable; see `step_adjust_latency()` for some methods included in this package.
166163
Much of that functionality is built into `arx_forecaster()` using the parameter
@@ -172,16 +169,16 @@ Much of that functionality is built into `arx_forecaster()` using the parameter
172169
One of the most common use cases of `epiprocess::epi_archive()` object is for
173170
accurate model back-testing.
174171

175-
To start, let's use a simple autoregressive forecaster to predict the percentage
176-
of doctor's hospital visits with CLI (COVID-like illness) (`percent_cli`) 14
172+
To start, let's use a simple autoregressive forecaster to predict `percent_cli`, the percentage
173+
of doctor's hospital visits associated with COVID-like illness, 14
177174
days in the future.
178175
For increased accuracy we will use quantile regression.
179176

180177
## Comparing a single day and ahead
181178

182-
As a sanity check before we backtest the entire dataset, let's look at
183-
forecasting a single day in the middle of the dataset.
184-
One way to do this is by setting the `.version` argument for `epix_slide()`:
179+
As a sanity check before we backtest the _entire_ dataset, let's
180+
forecast a single day in the middle of the dataset.
181+
We can do this by setting the `.version` argument in `epix_slide()`:
185182

186183
```{r single_version, warn = FALSE}
187184
forecast_date <- as.Date("2021-04-06")
@@ -198,8 +195,8 @@ forecasts <- doctor_visits |>
198195
)
199196
```
200197

201-
As truth data, we'll compare with the `epix_as_of()` to generate a snapshot of
202-
the archive at the last date[^1].
198+
We need truth data to compare our forecast against. We can construct it by using `epix_as_of()` to snapshot
199+
the archive at the last available date[^1].
203200

204201
```{r compare_single_with_result}
205202
forecasts |>
@@ -211,20 +208,20 @@ forecasts |>
211208
select(geo_value, forecast_date, .pred, `0.05`, `0.95`, percent_cli)
212209
```
213210

214-
`.pred` corresponds to the point forecast (median) and the `0.05`, `0.95`
215-
correspond to those respective quantiles.
216-
The forecasts fall within the prediction interval, so our
211+
`.pred` corresponds to the point forecast (median), and `0.05` and `0.95`
212+
correspond to the 5th and 95th quantiles.
213+
The `percent_cli` truth data falls within the prediction intervals, so our
217214
implementation passes a simple validation.
218215

219-
## Comparing version faithful and version faithless forecasts
216+
## Comparing version faithful and version un-faithful forecasts
217+
218+
Now let's compare the behavior of this forecaster, both properly considering data versioning
219+
("version faithful") and ignoring data versions ("version un-faithful").
220220

221-
Now let's go ahead and slide this forecaster in a version faithless way and a
222-
version faithful way.
223-
For the version faithless way, to still use `epix_slide` to do the backtesting
224-
we need to snapshot the latest version of the data, and then make a faux archive
225-
by setting `version = time_value`[^2].
226-
This has the effect of simulating a data set that receives the final version
227-
updates every day.
221+
For the version un-faithful approach, we need to do some setup if we want to use `epix_slide` for backtesting.
222+
We want to simulate a data set that receives finalized updates every day, that is, a data set with no revisions.
223+
To do this, we will snapshot the latest version of the data to create a synthetic data set, and convert it into an archive
224+
where `version = time_value`[^2].
228225

229226
```{r}
230227
archive_cases_dv_subset_faux <- doctor_visits |>
@@ -233,9 +230,9 @@ archive_cases_dv_subset_faux <- doctor_visits |>
233230
as_epi_archive()
234231
```
235232

236-
For the version faithful way, we will simply use the true `epi_archive` object.
237-
To reduce typing, we'll create `forecast_wrapper()` to cover mapping across
238-
aheads and some variations we will use later.
233+
For the version faithful approach, we will continue using the original `epi_archive` object containing all version updates.
234+
235+
We will also create the helper function `forecast_wrapper()` to let us easily map across aheads.
239236

240237
```{r arx-kweek-preliminaries, warning = FALSE}
241238
forecast_wrapper <- function(
@@ -259,13 +256,12 @@ forecast_wrapper <- function(
259256
}
260257
```
261258

262-
Note that we have used the parameter`adjust_latency` mentioned above, which
263-
comes up because on any given forecast date, such as `2020-08-01` the latest
264-
data to be released may be several days old.
259+
Note that in the helper function, we're using the parameter `adjust_latency`.
260+
We need to use it because the most recently released data may still be several days old on any given forecast date;
265261
`adjust_latency` will modify the forecaster to compensate[^5].
266262
See the function `step_adjust_latency()` for more details and examples.
267263

268-
And now we generate the forecasts for both the version faithful and faithless
264+
Now that we're set up, we can generate forecasts for both the version faithful and un-faithful
269265
archives, and bind the results together.
270266

271267
```{r generate_forecasts, warning = FALSE}
@@ -276,39 +272,39 @@ forecast_dates <- seq(
276272
)
277273
aheads <- c(1, 7, 14, 21, 28)
278274
279-
version_faithless <- archive_cases_dv_subset_faux |>
275+
version_unfaithful <- archive_cases_dv_subset_faux |>
280276
epix_slide(
281277
~ forecast_wrapper(.x, aheads, "percent_cli", "percent_cli"),
282278
.before = 120,
283279
.versions = forecast_dates
284280
) |>
285-
mutate(version_faithful = FALSE)
281+
mutate(version_faithful = "Version un-faithful")
286282
287283
version_faithful <- doctor_visits |>
288284
epix_slide(
289285
~ forecast_wrapper(.x, aheads, "percent_cli", "percent_cli"),
290286
.before = 120,
291287
.versions = forecast_dates
292288
) |>
293-
mutate(version_faithful = TRUE)
289+
mutate(version_faithful = "Version faithful")
294290
295291
forecasts <-
296292
bind_rows(
297-
version_faithless,
293+
version_unfaithful,
298294
version_faithful
299295
)
300296
```
301297

302-
Here, `arx_forecaster()` does all the heavy lifting.
303-
It creates leads of the target (respecting time stamps and locations) along with
304-
lags of the features (here, the response and doctors visits), estimates a
298+
`arx_forecaster()` does all the heavy lifting.
299+
It creates and lags copies of the features (here, the response and doctors visits),
300+
creates and leads copies of the target (respecting time stamps and locations), fits a
305301
forecasting model using the specified engine, creates predictions, and
306-
non-parametric confidence bands.
302+
creates non-parametric confidence bands.
307303

308-
To see how the predictions compare, we plot them on top of the latest case
309-
rates, using the same versioning plotting method as above.
310-
Note that even though we've fitted the model on four states (ca, fl, tx, and
311-
ny), we'll just display the results for two states, California (CA) and Florida
304+
To see how the version faithful and un-faithful predictions compare, let's plot them on top of the latest case
305+
rates, using the same versioned plotting method as above.
306+
Note that even though we fit the model on four states (California, Texas, Florida, and
307+
New York), we'll just display the results for two states, California (CA) and Florida
312308
(FL), to get a sense of the model performance while keeping the graphic simple.
313309

314310
<details>
@@ -372,30 +368,37 @@ p2 <-
372368
p1
373369
```
374370

375-
For California above and Florida below, neither approach produces amazingly
376-
accurate forecasts.
377-
The difference between the forecasts produced is quite striking however,
378-
especially in the single day ahead forecasts.
371+
The version faithful and un-faithful forecasts look moderately similar except for the 1 day horizons
372+
(although neither approach produces amazingly accurate forecasts).
379373

380-
In the version faithful case for California above, the December 2020 forecast
381-
starts at nearly 30, whereas the equivalent version faithless forecast at least
382-
starts at the correct value, even if it is overly pessimistic about the eventual
383-
peak that will occur.
374+
In the version faithful case for California, the March 2021 forecast (turquoise)
375+
starts at a value just above 10, which is very well lined up with reported values leading up to that forecast.
376+
The measured and forecasted trends are also concordant (both increasingly moderately fast).
384377

385-
In the version faithful case for Florida below, the late 2021 forecasts, such as
386-
September are wildly off base, routinely estimating zero cases, thanks to the
387-
versioned data systematically under-reporting.
388-
There is a similar if somewhat less extreme effect on December 2020 as in
389-
California.
378+
Because the data for this time period was later adjusted down with a decreasing trend, the March 2021 forecast looks quite bad compared to finalized data.
390379

391-
Without using the data as you would have seen it at the time as in the version
392-
faithful case, you potentially have no insight into what kind of performance you
393-
can expect in practice.
380+
The equivalent version un-faithful forecast starts at a value of 5, which is in line with the finalized data but would have been out of place compared to the version data.
394381

395382
```{r show-plot2, warning = FALSE, echo=FALSE}
396383
p2
397384
```
398385

386+
Now let's look at Florida.
387+
In the version faithful case, the three late-2021 forecasts (purples and pinks) starting in September predict very low values, near 0.
388+
The trend leading up to each forecast shows a substantial decrease, so these forecasts seem appropriate and we would expect them to score fairly well on various performance metrics when compared to the versioned data.
389+
390+
In hindsight, we know that early versions of the data systematically under-reported COVID-related doctor visits such that these forecasts don't actually perform well compared to _finalized_ data.
391+
In this example, version faithful forecasts predicted values at or near 0 while finalized data shows values in the 5-10 range.
392+
As a result, the version un-faithful forecasts for these same dates are quite a bit higher, and would perform well when scored using the finalized data and poorly with versioned data.
393+
394+
In general, the longer ago a forecast was made, the worse its performance is compared to finalized data. Finalized data accumulates revisions over time that make it deviate more and more from the non-finalized data a model was trained on.
395+
Forecasts trained solely on finalized data will of course appear to perform better when scored on finalized data, but will have unknown performance on the non-finalized data we need to use if we want timely predictions.
396+
397+
Without using data that would have been available on the actual forecast date,
398+
you have little insight into what level of performance you
399+
can expect in practice.
400+
401+
399402

400403
[^1]: For forecasting a single day like this, we could have actually just used
401404
`doctor_visits |> epix_as_of(forecast_date)` to get the relevant snapshot, and then fed that into `arx_forecaster()` as we did in the [landing

0 commit comments

Comments
 (0)