Skip to content

Commit 76ec2f4

Browse files
committed
use function auto-links
1 parent 5cf2339 commit 76ec2f4

File tree

15 files changed

+100
-92
lines changed

15 files changed

+100
-92
lines changed

_freeze/epipredict/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/flatline-forecaster/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/forecast-framework/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/preprocessing-and-models/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/sliding-forecasters/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/tidymodels-intro/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_freeze/tidymodels-regression/execute-results/html.json

+2-2
Large diffs are not rendered by default.

_quarto.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -54,4 +54,4 @@ format:
5454
sidebar-width: 400px
5555
body-width: 600px
5656
theme: [cosmo, delphi-epitools.scss]
57-
57+
code-link: true

epipredict.qmd

+13-12
Original file line numberDiff line numberDiff line change
@@ -39,39 +39,39 @@ There are four types of components:
3939
3. Predictor: make predictions, using a fitted model object and processed test data
4040
4. Postprocessor: manipulate or transform the predictions before returning
4141

42-
Users familiar with [`{tidymodels}`](https://www.tidymodels.org) and especially
43-
the [`{workflows}`](https://workflows.tidymodels.org) package will notice a lot
42+
Users familiar with `{tidymodels}` and especially
43+
the `{workflows}` package will notice a lot
4444
of overlap. This is by design, and is in fact a feature. The truth is that
4545
`{epipredict}` is a wrapper around much that is contained in these packages.
4646
Therefore, if you want something from this -verse, it should "just work" (we hope).
4747

48-
The reason for the overlap is that `{workflows}` _already implements_ the first
48+
The reason for the overlap is that `workflows` _already implements_ the first
4949
three steps. And it does this very well. However, it is missing the
5050
postprocessing stage and currently has no plans for such an implementation.
5151
And this feature is important. All forecasters need post-processing. Anything more complicated (which is nearly everything)
5252
needs this as well.
5353

54-
The second omission from `{tidymodels}` is support for panel data. Besides
54+
The second omission from `tidymodels` is support for panel data. Besides
5555
epidemiological data, economics, psychology, sociology, and many other areas
56-
frequently deal with data of this type. So the framework of behind `{epipredict}`
56+
frequently deal with data of this type. So the framework of behind `epipredict`
5757
implements this. In principle, this has nothing to do with epidemiology, and
5858
one could simply use this package as a solution for the missing functionality in
59-
`{tidymodels}`. Again, this should "just work" (we hope).
59+
`tidymodels`. Again, this should "just work" (we hope).
6060

6161
All of the _panel data_ functionality is implemented through the `epi_df` data type
6262
described in the previous part. If you have different panel data, just force it
6363
into an `epi_df` as described in @sec-additional-keys.
6464

6565
## Why doesn't this package already exist?
6666

67-
- Parts of it actually DO exist. There's a universe called `{tidymodels}`. It
67+
- Parts of it actually DO exist. There's a universe called `tidymodels`. It
6868
handles pre-processing, training, and prediction, bound together, through a
69-
package called workflows. We built `{epipredict}` on top of that setup. In this
69+
package called workflows. We built `epipredict` on top of that setup. In this
7070
way, you CAN use almost everything they provide.
7171
- However, workflows doesn't do post-processing to the extent envisioned here.
72-
And nothing in `{tidymodels}` handles panel data.
72+
And nothing in `tidymodels` handles panel data.
7373
- The tidy-team doesn't have plans to do either of these things. (We checked).
74-
- There are two packages that do time series built on `{tidymodels}`, but it's
74+
- There are two packages that do time series built on `tidymodels`, but it's
7575
"basic" time series: 1-step AR models, exponential smoothing, STL decomposition,
7676
etc.[^1]
7777

@@ -94,6 +94,7 @@ in the built-in data frame).
9494
jhu <- case_death_rate_subset %>%
9595
filter(time_value >= max(time_value) - 30)
9696
97+
library(epipredict)
9798
out <- arx_forecaster(
9899
jhu,
99100
outcome = "death_rate",
@@ -128,7 +129,7 @@ By default, the forecaster predicts the outcome (`death_rate`) 1-week ahead,
128129
using 3 lags of each predictor (`case_rate` and `death_rate`) at 0 (today),
129130
1 week back and 2 weeks back. The predictors and outcome can be changed
130131
directly. The rest of the defaults are encapsulated into a list of arguments.
131-
This list is produced by `arx_args_list()`.
132+
This list is produced by `arx_args_list()`.
132133

133134
## Simple adjustments
134135

@@ -197,7 +198,7 @@ arx_args_list(
197198

198199
So far, our forecasts have been produced using simple linear regression. But this is not the only way to estimate such a model.
199200
The `trainer` argument determines the type of model we want.
200-
This takes a [`{parsnip}`](https://parsnip.tidymodels.org) model. The default is linear regression, but we could instead use a random forest with the `{ranger}` package:
201+
This takes a `{parsnip}` model. The default is linear regression, but we could instead use a random forest with the `{ranger}` package:
201202

202203
```{r ranger, warning = FALSE}
203204
out_rf <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),

flatline-forecaster.qmd

+3-2
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,14 @@ source("_common.R")
1313

1414

1515
We will continue to use the `case_death_rate_subset` dataset that comes with the
16-
`epipredict` package. In brief, this is a subset of the JHU daily COVID-19 cases
16+
`{epipredict}` package. In brief, this is a subset of the JHU daily COVID-19 cases
1717
and deaths by state. While this dataset ranges from Dec 31, 2020 to Dec 31,
1818
2021, we will only consider a small subset at the end of that range to keep our
1919
example relatively simple.
2020

2121
```{r}
2222
jhu <- case_death_rate_subset %>%
23-
dplyr::filter(time_value >= as.Date("2021-09-01"))
23+
filter(time_value >= as.Date("2021-09-01"))
2424
2525
jhu
2626
```
@@ -32,6 +32,7 @@ eath rate one week into the future, is to input the `epi_df` and the name of
3232
the column from it that we want to predict in the `flatline_forecaster` function.
3333

3434
```{r}
35+
library(epipredict)
3536
one_week_ahead <- flatline_forecaster(jhu, outcome = "death_rate")
3637
one_week_ahead
3738
```

forecast-framework.qmd

+8-7
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ to examine the data and an estimated canned corecaster.
2121

2222

2323
```{r demo-workflow}
24+
library(epipredict)
2425
jhu <- case_death_rate_subset %>%
2526
filter(time_value >= max(time_value) - 30)
2627
@@ -31,18 +32,18 @@ out_gb <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
3132
## Preprocessing
3233

3334
Preprocessing is accomplished through a `recipe` (imagine baking a cake) as
34-
provided in the [`{recipes}`](https://recipes.tidymodels.org) package.
35+
provided in the `{recipes}` package.
3536
We've made a few modifications (to handle
3637
panel data) as well as added some additional options. The recipe gives a
3738
specification of how to handle training data. Think of it like a fancified
3839
`formula` that you would pass to `lm()`: `y ~ x1 + log(x2)`. In general,
39-
there are 2 extensions to the `formula` that `{recipes}` handles:
40+
there are 2 extensions to the `formula` that `recipes` handles:
4041

4142
1. Doing transformations of both training and test data that can always be
4243
applied. These are things like taking the log of a variable, leading or
4344
lagging, filtering out rows, handling dummy variables, etc.
4445
2. Using statistics from the training data to eventually process test data.
45-
This is a major benefit of `{recipes}`. It prevents what the tidy team calls
46+
This is a major benefit of `recipes`. It prevents what the tidy team calls
4647
"data leakage". A simple example is centering a predictor by its mean. We
4748
need to store the mean of the predictor from the training data and use that
4849
value on the test data rather than accidentally calculating the mean of
@@ -88,7 +89,7 @@ er <- epi_recipe(jhu) %>%
8889
step_epi_naomit()
8990
```
9091

91-
While `{recipes}` provides a function `step_lag()`, it assumes that the data
92+
While `recipes` provides a function `step_lag()`, it assumes that the data
9293
has no breaks in the sequence of `time_values`. This is a bit dangerous, so
9394
we avoid that behaviour. Our `lag/ahead` functions also appropriately adjust the
9495
amount of data to avoid accidentally dropping recent predictors from the test
@@ -97,9 +98,9 @@ data.
9798
## The model specification
9899

99100
Users familiar with the `{parsnip}` package will have no trouble here.
100-
Basically, `{parsnip}` unifies the function signature across statistical models.
101+
Basically, `parsnip` unifies the function signature across statistical models.
101102
For example, `lm()` "likes" to work with formulas, but `glmnet::glmnet()` uses
102-
`x` and `y` for predictors and response. `{parsnip}` is agnostic. Both of these
103+
`x` and `y` for predictors and response. `parsnip` is agnostic. Both of these
103104
do "linear regression". Above we switched from `lm()` to `xgboost()` without
104105
any issue despite the fact that these functions couldn't be more different.
105106

@@ -109,7 +110,7 @@ lm(
109110
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
110111
contrasts = NULL, offset, ...)
111112
112-
xgboost(
113+
xgboost::xgboost(
113114
data = NULL, label = NULL, missing = NA, weight = NULL,
114115
params = list(), nrounds, verbose = 1, print_every_n = 1L,
115116
early_stopping_rounds = NULL, maximize = NULL, save_period = NULL,

preprocessing-and-models.qmd

+21-25
Original file line numberDiff line numberDiff line change
@@ -10,19 +10,16 @@ source("_common.R")
1010
## Introduction
1111

1212
The `{epipredict}` package uses the `{tidymodels}` framework, namely
13-
[`{recipes}`](https://recipes.tidymodels.org/) for
14-
[dplyr](https://dplyr.tidyverse.org/)-like pipeable sequences
15-
of feature engineering and [`{parsnip}`](https://parsnip.tidymodels.org/)
16-
for a unified interface to a range of models.
13+
`{recipes}` for `{dplyr}`-like pipeable sequences of feature engineering and `{parsnip}` for a unified interface to a range of models.
1714

18-
`{epipredict}` has additional customized feature engineering and preprocessing
15+
`epipredict` has additional customized feature engineering and preprocessing
1916
steps that specifically work with panel data in this context, for example,
2017
`step_epi_lag()`, `step_population_scaling()`,
2118
`step_epi_naomit()`. They can be used along with most
22-
steps from the `{recipes}` package for more feature engineering.
19+
steps from the `recipes` package for more feature engineering.
2320

24-
In this vignette, we will illustrate some examples of how to use `{epipredict}`
25-
with `{recipes}` and `{parsnip}` for different purposes of
21+
In this vignette, we will illustrate some examples of how to use `epipredict`
22+
with `recipes` and `parsnip` for different purposes of
2623
epidemiological forecasting.
2724
We will focus on basic autoregressive models, in which COVID cases and
2825
deaths in the near future are predicted using a linear combination of cases
@@ -40,6 +37,7 @@ library(epipredict)
4037
library(recipes)
4138
library(workflows)
4239
library(poissonreg)
40+
library(epidatasets)
4341
```
4442

4543
## Poisson Regression
@@ -53,12 +51,10 @@ deploying control measures.
5351
One of the outcomes that the CDC forecasts is [death counts from COVID-19](https://www.cdc.gov/coronavirus/2019-ncov/science/forecasting/forecasting-us.html).
5452
Although there are many state-of-the-art models, we choose to use Poisson
5553
regression, the textbook example for modeling count data, as an illustration
56-
for using the `{epipredict}` package with other existing `{tidymodels}` packages.
54+
for using the `epipredict` package with other existing `tidymodels` packages.
5755

5856
The (folded) code below gives the necessary commands to download this data
59-
from the Delphi Epidata API, but it is also built into the
60-
[`{epidatasets}`](https://cmu-delphi.github.io/epidatasets/reference/counts_subset.html)
61-
package.
57+
`counts_subset` from the Delphi Epidata API, but it is also built into the `{epidatasets}` package.
6258

6359
```{r poisson-reg-data}
6460
#| eval: false
@@ -92,7 +88,7 @@ counts_subset <- full_join(x, y, by = c("geo_value", "time_value")) %>%
9288
data(counts_subset, package = "epidatasets")
9389
```
9490

95-
The `counts_subset` dataset
91+
The `epidatasets::counts_subset` dataset
9692
contains the number of confirmed cases and deaths from June 4, 2021 to
9793
Dec 31, 2021 in some U.S. states.
9894

@@ -113,11 +109,11 @@ $s_{\text{state}}$ are dummy variables for each state and take values of either
113109
0 or 1.
114110

115111
Preprocessing steps will be performed to prepare the
116-
data for model fitting. But before diving into them, it will be helpful to understand what `roles` are in the `{recipes}` framework.
112+
data for model fitting. But before diving into them, it will be helpful to understand what `roles` are in the `recipes` framework.
117113

118114
---
119115

120-
#### Aside on `{recipes}` {.unnumbered}
116+
#### Aside on `recipes` {.unnumbered}
121117

122118
`{recipes}` can assign one or more roles to each column in the data. The roles
123119
are not restricted to a predefined set; they can be anything.
@@ -133,7 +129,7 @@ that are unique to the `epipredict` package. Since we work with `epi_df`
133129
objects, all datasets should have `geo_value` and `time_value` passed through
134130
automatically with these two roles assigned to the appropriate columns in the data.
135131

136-
The `{recipes}` package also allows [manual alterations of roles](https://recipes.tidymodels.org/reference/roles.html)
132+
The `recipes` package also allows [manual alterations of roles](https://recipes.tidymodels.org/reference/roles.html)
137133
in bulk. There are a few handy functions that can be used together to help us
138134
manipulate variable roles easily.
139135

@@ -194,8 +190,8 @@ extract_fit_engine(wf)
194190
```
195191

196192
Alternative forms of Poisson regression or particular computational approaches
197-
can be applied via arguments to `parsnip::poisson_reg()` for some common
198-
settings, and by using `parsnip::set_engine()` to use a specific Poisson
193+
can be applied via arguments to `poisson_reg()` for some common
194+
settings, and by using `set_engine()` to use a specific Poisson
199195
regression engine and to provide additional engine-specific customization.
200196

201197

@@ -207,8 +203,8 @@ However, the Delphi Group preferred to train on rate data instead, because it
207203
puts different locations on a similar scale (eliminating the need for location-specific intercepts).
208204
We can use a linear regression to predict the death rates and use state
209205
population data to scale the rates to counts.[^pois] We will do so using
210-
`layer_population_scaling()` from the `{epipredict}` package. (We could also use
211-
`step_population_scaling()` from the `{epipredict}` package to prepare rate data
206+
`layer_population_scaling()` from the `epipredict` package. (We could also use
207+
`step_population_scaling()` to prepare rate data
212208
from count data in the preprocessing recipe.)
213209

214210
[^pois]: We could continue with the Poisson model, but we'll switch to the Gaussian likelihood just for simplicity.
@@ -263,7 +259,7 @@ pop_dat <- state_census %>% select(abbr, pop)
263259
```
264260

265261
State-wise population data from the 2019 U.S. Census is
266-
available from `{epipredict}` and will be used in `layer_population_scaling()`.
262+
available from `epipredict` and will be used in `layer_population_scaling()`.
267263

268264

269265

@@ -296,9 +292,9 @@ jhu <- filter(
296292
)
297293
```
298294

299-
Preprocessing steps will again rely on functions from the `{epipredict}` package
300-
as well as the `{recipes}` package.
301-
There are also many functions in the `{recipes}` package that allow for
295+
Preprocessing steps will again rely on functions from the `epipredict` package
296+
as well as the `recipes` package.
297+
There are also many functions in the `recipes` package that allow for
302298
[scalar transformations](https://recipes.tidymodels.org/reference/#step-functions-individual-transformations),
303299
such as log transformations and data centering. In our case, we will
304300
center the numerical predictors to allow for a more meaningful interpretation of
@@ -437,7 +433,7 @@ $$
437433

438434
Preprocessing steps are similar to the previous models with an additional step
439435
of categorizing the response variables. Again, we will use a subset of death rate and case rate data from our built-in dataset
440-
`case_death_rate_subset`.
436+
`epipredict::case_death_rate_subset`.
441437
```{r}
442438
jhu_rates <- case_death_rate_subset %>%
443439
dplyr::filter(

sliding-forecasters.qmd

+3-2
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ source("_common.R")
99

1010
A key function from the epiprocess package is `epi_slide()`, which allows the
1111
user to apply a function or formula-based computation over variables in an
12-
`epi_df` over a running window of `n` time steps (see the following `{epiprocess}`
12+
`epi_df` over a running window of `n` time steps (see the following `epiprocess`
1313
vignette to go over the basics of the function: ["Slide a computation over
1414
signal values"](https://cmu-delphi.github.io/epiprocess/articles/slide.html)).
1515
The equivalent sliding method for an `epi_archive` object can be called by using
@@ -41,6 +41,7 @@ version of each observation can be carried forward to extrapolate unavailable
4141
versions for the less up-to-date input archive.
4242

4343
```{r grab-epi-data}
44+
library(epipredict)
4445
us_raw_history_dfs <-
4546
readRDS(system.file("extdata", "all_states_covidcast_signals.rds",
4647
package = "epipredict", mustWork = TRUE))
@@ -242,7 +243,7 @@ ggplot(can_fc %>% filter(engine_type == "xgboost"),
242243
Both approaches tend to produce quite volatile forecasts (point predictions)
243244
and/or are overly confident (very narrow bands), particularly when boosted
244245
regression trees are used. But as this is meant to be a simple demonstration of
245-
sliding with different engines in `arx_forecaster`, we may devote another
246+
sliding with different engines in `arx_forecaster()`, we may devote another
246247
vignette to work on improving the predictive modelling using the suite of tools
247248
available in epipredict.
248249

0 commit comments

Comments
 (0)