Skip to content

Commit 5cf2339

Browse files
committed
fix partial #8
1 parent b630f19 commit 5cf2339

File tree

14 files changed

+93
-94
lines changed

14 files changed

+93
-94
lines changed

_freeze/epipredict/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/flatline-forecaster/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/forecast-framework/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/preprocessing-and-models/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/sliding-forecasters/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/tidymodels-intro/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/tidymodels-regression/execute-results/html.json

+2-2
Large diffs are not rendered by default.

epipredict.qmd

+16-16
Original file line numberDiff line numberDiff line change
@@ -10,21 +10,22 @@ At a high level, our goal with `{epipredict}` is to make running simple machine
1010
Serving both populations is the main motivation for our efforts, but at the same time, we have tried hard to make it useful.
1111

1212

13-
## Baseline models
13+
## Canned forecasters
1414

15-
We provide a set of basic, easy-to-use forecasters that work out of the box.
16-
You should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below.
15+
We provide a set of basic, easy-to-use forecasters that work out of the box:
1716

18-
For the basic forecasters, we provide:
19-
2017
* Flatline (basic) forecaster
2118
* Autoregressive forecaster
2219
* Autoregressive classifier
2320
* Smooth autoregressive(AR) forecaster
2421

25-
All the forcasters we provide are built on our framework. So we will use these basic models to illustrate its flexibility.
22+
These forecasters encapsulate a series of operations (including data preprocessing, model fitting and etc.) all in instant one-liners.
23+
They are basically alternatives to each other. The main difference is the use of different models. Three forecasters use different regression models and the other one use a classification model.
24+
25+
The operations within canned forecasters all follow our uniform **framework**.
26+
Although these one-liners allow a reasonably limited amount of customization, to uncover any serious customization you need more knowledge on our framework explained in @sec-framework.
2627

27-
## Forecasting framework
28+
## Forecasting framework {#sec-framework}
2829

2930
At its core, `{epipredict}` is a **framework** for creating custom forecasters.
3031
By that we mean that we view the process of creating custom forecasters as
@@ -47,8 +48,7 @@ Therefore, if you want something from this -verse, it should "just work" (we hop
4748
The reason for the overlap is that `{workflows}` _already implements_ the first
4849
three steps. And it does this very well. However, it is missing the
4950
postprocessing stage and currently has no plans for such an implementation.
50-
And this feature is important. The baseline forecaster we provide _requires_
51-
postprocessing. Anything more complicated (which is nearly everything)
51+
And this feature is important. All forecasters need post-processing. Anything more complicated (which is nearly everything)
5252
needs this as well.
5353

5454
The second omission from `{tidymodels}` is support for panel data. Besides
@@ -64,14 +64,14 @@ into an `epi_df` as described in @sec-additional-keys.
6464

6565
## Why doesn't this package already exist?
6666

67-
- Parts of it actually DO exist. There's a universe called `tidymodels`. It
67+
- Parts of it actually DO exist. There's a universe called `{tidymodels}`. It
6868
handles pre-processing, training, and prediction, bound together, through a
69-
package called workflows. We built `epipredict` on top of that setup. In this
69+
package called workflows. We built `{epipredict}` on top of that setup. In this
7070
way, you CAN use almost everything they provide.
7171
- However, workflows doesn't do post-processing to the extent envisioned here.
72-
And nothing in `tidymodels` handles panel data.
72+
And nothing in `{tidymodels}` handles panel data.
7373
- The tidy-team doesn't have plans to do either of these things. (We checked).
74-
- There are two packages that do time series built on `tidymodels`, but it's
74+
- There are two packages that do time series built on `{tidymodels}`, but it's
7575
"basic" time series: 1-step AR models, exponential smoothing, STL decomposition,
7676
etc.[^1]
7777

@@ -101,7 +101,7 @@ out <- arx_forecaster(
101101
)
102102
```
103103

104-
This call produces a warning, which we'll ignore for now. But essentially, it's telling us that our data comes from May 2022 but we're trying to do a forecast for January 2022. The result is likely not an accurate measure of real-time forecast performance, because the data have been revised over time.
104+
This call produces a warning, which we'll ignore for now. But essentially, it's telling us that our data comes from May 2022 but we're trying to do a forecast for January 2022. The result is likely not an accurate measure of real-time forecast performance, because the data has been revised over time.
105105

106106
```{r}
107107
out
@@ -115,7 +115,7 @@ of what the predictions are for. It contains three main components:
115115
```{r}
116116
str(out$metadata)
117117
```
118-
2. The predictions in a tibble. The columns give the predictions for each location along with additional columns. By default, these are a 90% predictive interval, the `forecast_date` (the date on which the forecast was putatively made) and the `target_date` (the date for which the forecast is being made).
118+
2. The predictions in a tibble. The columns give the predictions for each location along with additional columns. By default, these are a 90% prediction interval, the `forecast_date` (the date on which the forecast was putatively made) and the `target_date` (the date for which the forecast is being made).
119119
```{r}
120120
out$predictions
121121
```
@@ -159,7 +159,7 @@ likely increase the variance of the model, and therefore, may lead to less
159159
accurate forecasts for the variable of interest.
160160

161161

162-
Another property of the basic model is the predictive interval. We describe this in more detail in a coming chapter, but it is easy to request multiple quantiles.
162+
Another property of the basic model is the prediction interval. We describe this in more detail in a coming chapter, but it is easy to request multiple quantiles.
163163

164164
```{r differential-levels}
165165
out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),

flatline-forecaster.qmd

+5-5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Introducing the flatline forecaster
22

3-
The flatline forecaster is a very simple forecasting model intended for `epi_df` data, where the most recent observation is used as the forecast for any future date. In other words, the last observation is propagated forward. Hence, a flat line phenomenon is observed for the point predictions. The predictive intervals are produced from the quantiles of the residuals of such a forecast over all of the training data. By default, these intervals will be obtained separately for each combination of keys (`geo_value` and any additional keys) in the `epi_df`. Thus, the output is a data frame of point (and optionally interval) forecasts at a single unique horizon (`ahead`) for each unique combination of key variables. This forecaster is comparable to the baseline used by the [COVID Forecast Hub](https://covid19forecasthub.org).
3+
The flatline forecaster is a very simple forecasting model intended for `epi_df` data, where the most recent observation is used as the forecast for any future date. In other words, the last observation is propagated forward. Hence, a flat line phenomenon is observed for the point predictions. The prediction intervals are produced from the quantiles of the residuals of such a forecast over all of the training data. By default, these intervals will be obtained separately for each combination of keys (`geo_value` and any additional keys) in the `epi_df`. Thus, the output is a data frame of point (and optionally interval) forecasts at a single unique horizon (`ahead`) for each unique combination of key variables. This forecaster is comparable to the baseline used by the [COVID Forecast Hub](https://covid19forecasthub.org).
44

55
## Example of using the flatline forecaster
66

@@ -55,8 +55,8 @@ five_days_ahead <- flatline_forecaster(
5555
five_days_ahead
5656
```
5757

58-
We could also specify that we want a 80% predictive interval by changing the
59-
levels. The default 0.05 and 0.95 levels/quantiles give us 90% predictive
58+
We could also specify that we want a 80% prediction interval by changing the
59+
levels. The default 0.05 and 0.95 levels/quantiles give us 90% prediction
6060
interval.
6161

6262
```{r}
@@ -117,15 +117,15 @@ extract_frosting(five_days_ahead$epi_workflow)
117117
```
118118

119119

120-
The post-processing operations in the order that were performed were to create the predictions and the predictive intervals, add the forecast and target dates and bound the predictions at zero.
120+
The post-processing operations in the order that were performed were to create the predictions and the prediction intervals, add the forecast and target dates and bound the predictions at zero.
121121

122122
We can also easily examine the predictions themselves.
123123

124124
```{r}
125125
five_days_ahead$predictions
126126
```
127127

128-
The results above show a distributional forecast produced using data through the end of 2021 for the January 5, 2022. A prediction for the death rate per 100K inhabitants along with a 95% predictive interval is available for every state (`geo_value`).
128+
The results above show a distributional forecast produced using data through the end of 2021 for the January 5, 2022. A prediction for the death rate per 100K inhabitants along with a 95% prediction interval is available for every state (`geo_value`).
129129

130130
The figure below displays the prediction and prediction interval for three sample states: Arizona, New York, and Florida.
131131

forecast-framework.qmd

+1-1
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ er <- epi_recipe(jhu) %>%
8989
```
9090

9191
While `{recipes}` provides a function `step_lag()`, it assumes that the data
92-
have no breaks in the sequence of `time_values`. This is a bit dangerous, so
92+
has no breaks in the sequence of `time_values`. This is a bit dangerous, so
9393
we avoid that behaviour. Our `lag/ahead` functions also appropriately adjust the
9494
amount of data to avoid accidentally dropping recent predictors from the test
9595
data.

preprocessing-and-models.qmd

+16-15
Original file line numberDiff line numberDiff line change
@@ -2,26 +2,27 @@
22

33
```{r}
44
#| echo: false
5+
#| warning: false
56
source("_common.R")
67
```
78

89

910
## Introduction
1011

11-
The `epipredict` package uses the `tidymodels` framework, namely
12+
The `{epipredict}` package uses the `{tidymodels}` framework, namely
1213
[`{recipes}`](https://recipes.tidymodels.org/) for
1314
[dplyr](https://dplyr.tidyverse.org/)-like pipeable sequences
1415
of feature engineering and [`{parsnip}`](https://parsnip.tidymodels.org/)
1516
for a unified interface to a range of models.
1617

17-
`epipredict` has additional customized feature engineering and preprocessing
18+
`{epipredict}` has additional customized feature engineering and preprocessing
1819
steps that specifically work with panel data in this context, for example,
1920
`step_epi_lag()`, `step_population_scaling()`,
2021
`step_epi_naomit()`. They can be used along with most
2122
steps from the `{recipes}` package for more feature engineering.
2223

23-
In this vignette, we will illustrate some examples of how to use `epipredict`
24-
with `recipes` and `parsnip` for different purposes of
24+
In this vignette, we will illustrate some examples of how to use `{epipredict}`
25+
with `{recipes}` and `{parsnip}` for different purposes of
2526
epidemiological forecasting.
2627
We will focus on basic autoregressive models, in which COVID cases and
2728
deaths in the near future are predicted using a linear combination of cases
@@ -52,7 +53,7 @@ deploying control measures.
5253
One of the outcomes that the CDC forecasts is [death counts from COVID-19](https://www.cdc.gov/coronavirus/2019-ncov/science/forecasting/forecasting-us.html).
5354
Although there are many state-of-the-art models, we choose to use Poisson
5455
regression, the textbook example for modeling count data, as an illustration
55-
for using the `epipredict` package with other existing `{tidymodels}` packages.
56+
for using the `{epipredict}` package with other existing `{tidymodels}` packages.
5657

5758
The (folded) code below gives the necessary commands to download this data
5859
from the Delphi Epidata API, but it is also built into the
@@ -112,13 +113,13 @@ $s_{\text{state}}$ are dummy variables for each state and take values of either
112113
0 or 1.
113114

114115
Preprocessing steps will be performed to prepare the
115-
data for model fitting. But before diving into them, it will be helpful to understand what `roles` are in the `recipes` framework.
116+
data for model fitting. But before diving into them, it will be helpful to understand what `roles` are in the `{recipes}` framework.
116117

117118
---
118119

119-
#### Aside on `recipes` {.unnumbered}
120+
#### Aside on `{recipes}` {.unnumbered}
120121

121-
`recipes` can assign one or more roles to each column in the data. The roles
122+
`{recipes}` can assign one or more roles to each column in the data. The roles
122123
are not restricted to a predefined set; they can be anything.
123124
For most conventional situations, they are typically “predictor” and/or
124125
"outcome". Additional roles enable targeted `step_*()` operations on specific
@@ -132,7 +133,7 @@ that are unique to the `epipredict` package. Since we work with `epi_df`
132133
objects, all datasets should have `geo_value` and `time_value` passed through
133134
automatically with these two roles assigned to the appropriate columns in the data.
134135

135-
The `recipes` package also allows [manual alterations of roles](https://recipes.tidymodels.org/reference/roles.html)
136+
The `{recipes}` package also allows [manual alterations of roles](https://recipes.tidymodels.org/reference/roles.html)
136137
in bulk. There are a few handy functions that can be used together to help us
137138
manipulate variable roles easily.
138139

@@ -170,7 +171,7 @@ r <- epi_recipe(counts_subset) %>%
170171
step_epi_naomit()
171172
```
172173

173-
After specifying the preprocessing steps, we will use the `parsnip` package for
174+
After specifying the preprocessing steps, we will use the `{parsnip}` package for
174175
modeling and producing the prediction for death count, 7 days after the
175176
latest available date in the dataset.
176177

@@ -206,8 +207,8 @@ However, the Delphi Group preferred to train on rate data instead, because it
206207
puts different locations on a similar scale (eliminating the need for location-specific intercepts).
207208
We can use a linear regression to predict the death rates and use state
208209
population data to scale the rates to counts.[^pois] We will do so using
209-
`layer_population_scaling()` from the `epipredict` package. (We could also use
210-
`step_population_scaling()` from the `epipredict` package to prepare rate data
210+
`layer_population_scaling()` from the `{epipredict}` package. (We could also use
211+
`step_population_scaling()` from the `{epipredict}` package to prepare rate data
211212
from count data in the preprocessing recipe.)
212213

213214
[^pois]: We could continue with the Poisson model, but we'll switch to the Gaussian likelihood just for simplicity.
@@ -295,9 +296,9 @@ jhu <- filter(
295296
)
296297
```
297298

298-
Preprocessing steps will again rely on functions from the `epipredict` package
299-
as well as the `recipes` package.
300-
There are also many functions in the `recipes` package that allow for
299+
Preprocessing steps will again rely on functions from the `{epipredict}` package
300+
as well as the `{recipes}` package.
301+
There are also many functions in the `{recipes}` package that allow for
301302
[scalar transformations](https://recipes.tidymodels.org/reference/#step-functions-individual-transformations),
302303
such as log transformations and data centering. In our case, we will
303304
center the numerical predictors to allow for a more meaningful interpretation of

sliding-forecasters.qmd

+4-3
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,14 @@
22

33
```{r}
44
#| echo: false
5+
#| warning: false
56
source("_common.R")
67
```
78

89

910
A key function from the epiprocess package is `epi_slide()`, which allows the
1011
user to apply a function or formula-based computation over variables in an
11-
`epi_df` over a running window of `n` time steps (see the following `epiprocess`
12+
`epi_df` over a running window of `n` time steps (see the following `{epiprocess}`
1213
vignette to go over the basics of the function: ["Slide a computation over
1314
signal values"](https://cmu-delphi.github.io/epiprocess/articles/slide.html)).
1415
The equivalent sliding method for an `epi_archive` object can be called by using
@@ -149,13 +150,13 @@ model.[^1]
149150

150151
### Example using case data from Canada
151152

152-
By leveraging the flexibility of `epiprocess`, we can apply the same techniques
153+
By leveraging the flexibility of `{epiprocess}`, we can apply the same techniques
153154
to data from other sources. Since some collaborators are in British Columbia,
154155
Canada, we'll do essentially the same thing for Canada as we did above.
155156

156157
The [COVID-19 Canada Open Data Working Group](https://opencovid.ca/) collects
157158
daily time series data on COVID-19 cases, deaths, recoveries, testing and
158-
vaccinations at the health region and province levels. Data are collected from
159+
vaccinations at the health region and province levels. Data is collected from
159160
publicly available sources such as government datasets and news releases.
160161
Unfortunately, there is no simple versioned source, so we have created our own
161162
from the Github commit history.

0 commit comments

Comments
 (0)