Skip to content

Commit b786750

Browse files
committed
epipredict.Rmd
1 parent d6edae8 commit b786750

File tree

1 file changed

+74
-64
lines changed

1 file changed

+74
-64
lines changed

vignettes/epipredict.Rmd

Lines changed: 74 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Get started with `{epipredict}`"
2+
title: "Get started with epipredict"
33
output: rmarkdown::html_vignette
44
vignette: >
55
%\VignetteIndexEntry{Get started with `{epipredict}`}
@@ -42,8 +42,8 @@ Towards that end, `{epipredict}` provides two main classes of tools:
4242

4343
A set of basic, easy-to-use "canned" forecasters that work out of the box.
4444
We currently provide the following basic forecasters:
45-
46-
* _Flatline forecaster_: predicts as the median the most recently seen value
45+
46+
* _Flatline forecaster_: predicts as the median the most recently seen value
4747
with increasingly wide quantiles.
4848
* _Climatological forecaster_: predicts the median and quantiles based on the historical values around the same date in previous years.
4949
* _Autoregressive forecaster_: fits a model (e.g. linear regression) on
@@ -58,7 +58,7 @@ We currently provide the following basic forecasters:
5858
A framework for creating custom forecasters out of modular components, from
5959
which the canned forecasters were created. There are three types of
6060
components:
61-
61+
6262
* _Preprocessor_: transform the data before model training, such as converting
6363
counts to rates, creating smoothed columns, or [any `{recipes}`
6464
`step`](https://recipes.tidymodels.org/reference/index.html)
@@ -194,12 +194,12 @@ If you want to make further modifications, you will need a custom
194194
workflow; see the [Custom Epiworkflows vignette](custom_epiworkflows) for details.
195195

196196
## Generating multiple aheads
197-
Frequently, one doesn't want just a forecast for a single day, but a trajectory
198-
of forecasts for several weeks.
199-
We can do this with `arx_forecaster()` by looping over aheads; for
200-
example, to predict every day over a 4-week time period:
197+
We often want to generate a a trajectory
198+
of forecasts over a range of dates, rather than for a single day.
199+
We can do this with `arx_forecaster()` by looping over aheads.
200+
For example, to predict every day over a 4-week time period:
201201

202-
```{r temp-thing}
202+
```{r aheads-loop}
203203
all_canned_results <- lapply(
204204
seq(0, 28),
205205
\(days_ahead) {
@@ -261,7 +261,7 @@ autoplot(
261261
### `cdc_baseline_forecaster()`
262262

263263
This is a different method of generating a flatline forecast, used as a baseline
264-
for [COVID19ForecastHub](https://covid19forecasthub.org).
264+
for [the CDC COVID-19 Forecasting Hub](https://covid19forecasthub.org).
265265

266266
```{r make-cdc-forecast, warning=FALSE}
267267
all_cdc_flatline <-
@@ -284,17 +284,18 @@ autoplot(
284284
)
285285
```
286286

287-
The median is the same, but the quantiles are generated using
287+
`cdc_baseline_forecaster()` and `flatline_forecaster()` generate medians in the same way,
288+
but `cdc_baseline_forecaster()`'s quantiles are generated using
288289
`layer_cdc_flatline_quantiles()` instead of `layer_residual_quantiles()`.
289-
Both rely on the computing the quantiles of the residuals, but this model
290-
extrapolates the quantiles by repeatedly sampling the initial quantiles to
291-
generate the next quantiles.
290+
Both quantile-generating methods use the residuals to compute quantiles, but
291+
`layer_cdc_flatline_quantiles()` extrapolates the quantiles by repeatedly
292+
sampling the initial quantiles to generate the next set.
292293
This results in much smoother quantiles, but ones that only capture the
293294
one-ahead uncertainty.
294295

295296
### `climatological_forecaster()`
296-
A different kind of baseline, the `climatological_forecaster()` forecasts the
297-
point forecast and quantiles based on the historical values for this time of
297+
The `climatological_forecaster()` is a different kind of baseline. It produces a
298+
point forecast and quantiles based on the historical values for a given time of
298299
year, rather than extrapolating from recent values.
299300
For example, on the same dataset as above:
300301
```{r make-climatological-forecast, warning=FALSE}
@@ -318,11 +319,12 @@ autoplot(
318319
)
319320
```
320321

321-
Note that we're using `covid_case_death_rates_extended` rather than
322-
`covid_case_death_rates`, since it starts in March of 2020 rather than December.
322+
Note that to have enough training data for this method, we're using
323+
`covid_case_death_rates_extended`, which starts in March 2020, rather than
324+
`covid_case_death_rates`, which starts in December.
323325
Without at least a year's worth of historical data, it is impossible to do a
324326
climatological model.
325-
Even with only one year as we have here the resulting forecasts are unreliable.
327+
Even with one year of data, as we have here, the resulting forecasts are unreliable.
326328

327329
One feature of the climatological baseline is that it forecasts multiple aheads
328330
simultaneously.
@@ -331,10 +333,9 @@ smooth_quantile_reg()`, which is built to handle multiple aheads simultaneously.
331333

332334
### `arx_classifier()`
333335

334-
The most complicated of the canned forecasters, `arx_classifier` first
335-
translates the outcome into a growth rate, and then classifies that growth rate
336-
into bins.
337-
For example, on the same dataset and `forecast_date` as above, we get:
336+
Unlike the other canned forecasters, `arx_classifier` predicts binned growth rate.
337+
The forecaster converts the raw outcome variable into a growth rate, which it then bins and predicts, using bin thresholds provided by the user.
338+
For example, on the same dataset and `forecast_date` as above, this model outputs:
338339

339340
```{r discrete-rt}
340341
classifier <- arx_classifier(
@@ -352,14 +353,18 @@ classifier <- arx_classifier(
352353
classifier$predictions
353354
```
354355

355-
The prediction splits into 4 cases: `(-∞, -0.01)`, `(-0.01, 0.01)`, `(0.01,
356-
0.1)`, and `(0.1, ∞)`.
357-
In this case, the classifier put all 4 of the states in the same category,
358-
`(0.01, 0.1)`. **TODO** _effected by the old data._
359-
The number and size of the categories is controlled by `breaks`, which gives the
360-
boundary values.
356+
The number and size of the growth rate categories is controlled by `breaks`, which define the
357+
bin boundaries.
358+
359+
In this example, the custom `breaks` passed to `arx_class_args_list()` correspond to 4 bins:
360+
`(-∞, -0.01)`, `(-0.01, 0.01)`, `(0.01, 0.1)`, and `(0.1, ∞)`.
361+
The bins can be interpreted as: the outcome variable is decreasing, approximately stable, slightly increasing, or increasing quickly.
361362

362-
For comparison, the growth rates for the `target_date`, as computed using
363+
The returned `predictions` assigns each state to one of the growth rate bins.
364+
In this case, the classifier expects the growth rate for all 4 of the states to fall into the same category,
365+
`(-0.01, 0.01]`.
366+
367+
To see how this model performed, let's compare to the actual growth rates for the `target_date`, as computed using
363368
`{epiprocess}`:
364369

365370
```{r growth_rate_results}
@@ -373,24 +378,29 @@ growth_rates <- covid_case_death_rates |>
373378
growth_rates |> filter(time_value == "2021-08-14")
374379
```
375380

376-
Unfortunately, this forecast was not particularly accurate, since for example
377-
`-1.39` is not remotely in the interval `(-0.01, 0.01]`.
381+
Unfortunately, this forecast was not particularly accurate. All real growth rates were larger than the predicted growth rates, with California (real growth rate `-1.39`) not remotely in the interval (`(-0.01, 0.01]`).
378382

379383

380384
## Fitting multi-key panel data
381385

382-
If you have multiple keys that are set in the `epi_df` as `other_keys`,
383-
`arx_forecaster` will automatically group by those as well.
384-
For example, predicting the number of graduates in each of the categories in `grad_employ` from above:
386+
If multiple keys are set in the `epi_df` as `other_keys`,
387+
`arx_forecaster` will automatically group by those in addition to the required geographic key.
388+
For example, predicting the number of graduates in each of the categories in `grad_employ_subset` from above:
385389

386390
```{r multi_key_forecast, warning=FALSE}
387391
# only fitting a subset, otherwise there are ~550 distinct pairs, which is bad for plotting
388392
edu_quals <- c("Undergraduate degree", "Professional degree")
389393
geo_values <- c("Quebec", "British Columbia")
390-
grad_forecast <- arx_forecaster(
391-
grad_employ_subset |>
394+
395+
grad_employ <- grad_employ_subset |>
392396
filter(time_value < 2017) |>
393-
filter(edu_qual %in% edu_quals, geo_value %in% geo_values),
397+
filter(edu_qual %in% edu_quals, geo_value %in% geo_values)
398+
399+
grad_employ
400+
401+
grad_forecast <- arx_forecaster(
402+
grad_employ |>
403+
filter(time_value < 2017),
394404
outcome = "num_graduates",
395405
predictors = c("num_graduates"),
396406
args_list = arx_args_list(
@@ -402,19 +412,20 @@ grad_forecast <- arx_forecaster(
402412
autoplot(
403413
grad_forecast$epi_workflow,
404414
grad_forecast$predictions,
405-
grad_employ_subset |>
406-
filter(edu_qual %in% edu_quals, geo_value %in% geo_values),
415+
grad_employ,
407416
)
408417
```
409418

410-
The 8 graphs are all pairs of the `geo_values` (`"Quebec"` and `"British Columbia"`), `edu_quals` (`"Undergraduate degree"` and `"Professional degree"`), and age brackets (`"15 to 34 years"` and `"35 to 64 years"`).
419+
The 8 graphs represent all combinations of the `geo_values` (`"Quebec"` and `"British Columbia"`), `edu_quals` (`"Undergraduate degree"` and `"Professional degree"`), and age brackets (`"15 to 34 years"` and `"35 to 64 years"`).
411420

412421
## Fitting a non-geo-pooled model
413422

414-
Because our internal methods fit a single model, to fit a non-geo-pooled model
415-
that has a different fit for each geography, one either needs a multi-level
416-
engine (which at the moment parsnip doesn't support), or one needs to map over
423+
The methods shown so far fit a single model across all geographic regions.
424+
This is called "geo-pooling".
425+
To fit a non-geo-pooled model that fits each geography separately, one either needs a multi-level
426+
engine (which at the moment `{parsnip}` doesn't support), or one needs to loop over
417427
geographies.
428+
Here, we're using `purrr::map` to perform the loop.
418429

419430
```{r fit_non_geo_pooled, warning=FALSE}
420431
geo_values <- covid_case_death_rates |>
@@ -441,12 +452,11 @@ all_fits <-
441452
map_df(all_fits, ~ pluck(., "predictions"))
442453
```
443454

444-
This is both 56 times slower[^7], and uses far less data to fit each model.
445-
If the geographies are at all comparable, for example by normalization, we would
446-
get much better results by pooling.
455+
Fitting separate models for each geography is both 56 times slower[^7] than geo-pooling, and fits each model on far less data.
456+
If a dataset contains relatively few observations for each geography, fitting a geo-pooled model is likely to produce better, more stable results.
457+
However, geo-pooling can only be used if values are comparable in meaning and scale across geographies or can be made comparable, for example by normalization.
447458

448-
If we wanted to build a geo-aware model, such as one that sets the constant in a
449-
linear regression fit to be different for each geography, we would need to build a [Custom workflow](custom_epiworkflows) with geography as a factor.
459+
If we wanted to build a geo-aware model, such as a linear regression with a different intercept for each geography, we would need to build a [custom workflow](custom_epiworkflows) with geography as a factor.
450460

451461
# Anatomy of a canned forecaster
452462
## Code object
@@ -468,21 +478,21 @@ four_week_ahead <- arx_forecaster(
468478

469479
`four_week_ahead` has two components: an `epi_workflow`, and a table of
470480
`predictions`.
471-
The table of predictions is simply a tibble of the predictions,
481+
The table of predictions is a simple tibble,
472482

473483
```{r show_predictions}
474484
four_week_ahead$predictions
475485
```
476486

477-
`.pred` gives the point/median prediction, while `.pred_distn` is a
487+
where `.pred` gives the point/median prediction, and `.pred_distn` is a
478488
`dist_quantiles()` object representing a distribution through various quantile
479489
levels.
480490
The `[6]` in the name refers to the number of quantiles that have been
481-
explicitly created[^4]; by default, this covers a 90% prediction interval, or 5%
482-
and 95%.
491+
explicitly created[^4].
492+
By default, `.pred_distn` covers a 90% prediction interval, reporting the 5% and 95% quantiles.
483493

484494
The `epi_workflow` is a significantly more complicated object, extending a
485-
`workflows::workflow()` to include post-processing:
495+
`workflows::workflow()` to include post-processing steps:
486496

487497
```{r show_workflow}
488498
four_week_ahead$epi_workflow
@@ -491,17 +501,17 @@ four_week_ahead$epi_workflow
491501
An `epi_workflow()` consists of 3 parts:
492502

493503
- `Preprocessor`: a collection of steps that transform the data to be ready for
494-
modelling. They come from this package or [any of the recipes
495-
steps](https://recipes.tidymodels.org/reference/index.html);
496-
`four_week_ahead` has 5 of these, and you can inspect them more closely by
504+
modelling. Steps can be custom, as are those included in this package,
505+
or [be defined in `{recipes}`](https://recipes.tidymodels.org/reference/index.html).
506+
`four_week_ahead` has 5 steps; you can inspect them more closely by
497507
running `hardhat::extract_recipe(four_week_ahead$epi_workflow)`.[^6]
498508
- `Model`: the actual model that does the fitting, given by a
499-
`parsnip::model_spec`; `four_week_ahead` has the default of
500-
`parsnip::linear_reg()`, which is a wrapper from `{parsnip}` for
509+
`parsnip::model_spec`. `four_week_ahead` uses the default of
510+
`parsnip::linear_reg()`, which is a `{parsnip}` wrapper for
501511
`stats::lm()`. You can inspect the model more closely by running
502512
`hardhat::extract_fit_recipe(four_week_ahead$epi_workflow)`.
503513
- `Postprocessor`: a collection of layers to be applied to the resulting
504-
forecast, internal to this package. `four_week_ahead` just so happens to have
514+
forecast. Layers are internal to this package. `four_week_ahead` just so happens to have
505515
5 of as these well. You can inspect the layers more closely by running
506516
`epipredict::extract_layers(four_week_ahead$epi_workflow)`.
507517

@@ -510,7 +520,7 @@ extending `four_week_ahead` using the custom forecaster framework.
510520

511521
## Mathematical description
512522

513-
Let's describe in more detail the actual fit model for a more minimal version of
523+
Let's look at the mathematical details of the model in more detail, using a minimal version of
514524
`four_week_ahead`:
515525

516526
```{r, four_week_again}
@@ -537,9 +547,9 @@ $$
537547
For example, $a_1$ is `lag_0_death_rate` above, with a value of `r hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_0_death_rate"] `,
538548
while $a_5$ is `r hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_7_case_rate"] `.
539549

540-
The training data for fitting this linear model is created by creating a series
541-
of columns shifted by the appropriate amount; this makes it so that each row
542-
without `NA` values is a training point to fit the coefficients $a_0,\ldots, a_6$.
550+
The training data for fitting this linear model is constructed within the `arx_forecaster()` function by shifting a series
551+
of columns the appropriate amount -- based on the requested `lags`.
552+
Each row containing no `NA` values is used as a training observation to fit the coefficients $a_0,\ldots, a_6$.
543553

544554
[^4]: in the case of a `{parsnip}` engine which doesn't explicitly predict
545555
quantiles, these quantiles are created using `layer_residual_quantiles()`,

0 commit comments

Comments
 (0)