88[ ![ R-CMD-check] ( https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml/badge.svg )] ( https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml )
99<!-- badges: end -->
1010
11- Epipredict is a framework for building transformation and forecasting
11+ ` {epipredict} ` is a framework for building transformation and forecasting
1212pipelines for epidemiological and other panel time-series datasets. In
1313addition to tools for building forecasting pipelines, it contains a
1414number of “canned” forecasters meant to run with little modification as
1515an easy way to get started forecasting.
1616
1717It is designed to work well with
18- [ ` epiprocess ` ] ( https://cmu-delphi.github.io/epiprocess/ ) , a utility for
19- handling various time series and geographic processing tools in an
18+ [ ` { epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ ) , a utility for
19+ time series handling and geographic processing in an
2020epidemiological context. Both of the packages are meant to work well
2121with the panel data provided by
22- [ ` epidatr ` ] ( https://cmu-delphi.github.io/epidatr/ ) .
22+ [ ` {epidatr} ` ] ( https://cmu-delphi.github.io/epidatr/ ) .
23+ Pre-compiled example datasets are also availalbe in [ ` {epidatasets} ` ] ( https://cmu-delphi.github.io/epidatasets/ ) .
2324
24- If you are looking for more detail beyond the package documentation, see
25- our [ forecasting
26- book] ( https://cmu-delphi.github.io/delphi-tooling-book/ ) .
25+ If you are looking for detail beyond the package documentation, see
26+ our [ forecasting book] ( https://cmu-delphi.github.io/delphi-tooling-book/ ) .
2727
2828## Installation
2929
30- To install (unless you’re planning on contributing to package
31- development, we suggest using the stable version):
30+ Unless you’re planning on contributing to package
31+ development, we suggest using the stable version.
32+ To install, run:
3233
3334``` r
3435# Stable version
@@ -44,25 +45,32 @@ is at <https://cmu-delphi.github.io/epipredict/dev>.
4445
4546## Motivating example
4647
47- To demonstrate the kind of forecast epipredict can make, say we’re
48- predicting COVID deaths per 100k for each state on
48+ To demonstrate the kind of forecast ` { epipredict} ` can make, say we want to
49+ predict COVID-19 deaths per 100k people for each state on 2021-08-01.
4950
5051``` r
52+ library(epipredict )
53+ library(epidatr )
54+ library(epiprocess )
55+ library(dplyr )
56+ library(ggplot2 )
57+
5158forecast_date <- as.Date(" 2021-08-01" )
5259```
5360
5461Below the fold, we construct this dataset as an ` epiprocess::epi_df `
55- from JHU data.
62+ from [ Johns Hopkins Center for Systems Science and Engineering deaths data] ( https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html ) .
5663
5764<details >
5865<summary >
5966Creating the dataset using ` {epidatr} ` and ` {epiprocess} `
6067</summary >
6168
62- This dataset can be found in the package as ` covid_case_death_rates ` ; we
63- demonstrate some of the typically ubiquitous cleaning operations needed
64- to be able to forecast. First we pull both jhu-csse cases and deaths
65- from [ ` {epidatr} ` ] ( https://cmu-delphi.github.io/epidatr/ ) package:
69+ This section is intended to demonstrate some of the ubiquitous cleaning operations needed
70+ to be able to forecast.
71+ The dataset prepared here is also included ready-to-go in ` {epipredict} ` as ` covid_case_death_rates ` .
72+
73+ First we pull both ` jhu-csse ` cases and deaths data from the [ Delphi API] ( https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html ) using the [ ` {epidatr} ` ] ( https://cmu-delphi.github.io/epidatr/ ) package:
6674
6775``` r
6876cases <- pub_covidcast(
@@ -87,7 +95,7 @@ deaths <- pub_covidcast(
8795```
8896
8997Since visualizing the results on every geography is somewhat
90- overwhelming, we’ll only train on a subset of 5 .
98+ overwhelming, we’ll only train on a subset of locations .
9199
92100``` r
93101used_locations <- c(" ca" , " ma" , " ny" , " tx" )
@@ -113,12 +121,11 @@ cases_deaths |>
113121
114122<img src =" man/figures/README-date-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
115123
116- As with basically any dataset, there is some cleaning that we will need
117- to do to make it actually usable; we’ll use some utilities from
124+ As with the typical dataset, we will need to do some cleaning to make it actually usable; we’ll use some utilities from
118125[ ` {epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ ) for this.
119126
120- First, to eliminate some of the noise coming from daily reporting, we do
121- 7 day averaging over a trailing window[ ^ 1 ] :
127+ First, to reduce the noise from daily reporting, we will compute a
128+ 7 day average over a trailing window[ ^ 1 ] :
122129
123130``` r
124131cases_deaths <-
@@ -134,7 +141,7 @@ cases_deaths <-
134141 rename(case_rate = cases_7dav , death_rate = death_rate_7dav )
135142```
136143
137- Then trimming outliers, most especially negative values:
144+ Then we'll trim outliers, especially negative values:
138145
139146``` r
140147cases_deaths <-
@@ -161,24 +168,25 @@ cases_deaths <-
161168
162169</details >
163170
164- After having downloaded and cleaned the data in ` cases_deaths ` , we plot
165- a subset of the states, noting the actual forecast date:
171+ After downloading and cleaning the cases and deaths data , we can plot
172+ a subset of the states, marking the desired forecast date:
166173
167174<details >
168175<summary >
169176Plot
170177</summary >
171178
172179``` r
180+ used_locations <- c(" ca" , " ma" , " ny" , " tx" )
173181forecast_date_label <-
174182 tibble(
175183 geo_value = rep(used_locations , 2 ),
176184 .response_name = c(rep(" case_rate" , 4 ), rep(" death_rate" , 4 )),
177185 dates = rep(forecast_date - 7 * 2 , 2 * length(used_locations )),
178186 heights = c(rep(150 , 4 ), rep(0.75 , 4 ))
179187 )
180- processed_data_plot <-
181- covid_case_death_rates | >
188+
189+ covid_case_death_rates | >
182190 filter(geo_value %in% used_locations ) | >
183191 autoplot(
184192 case_rate ,
@@ -204,13 +212,13 @@ processed_data_plot <-
204212
205213<img src =" man/figures/README-show-processed-data-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
206214
207- To make a forecast, we will use a “canned” simple auto-regressive
215+ To make a forecast, we will use a simple “canned” auto-regressive
208216forecaster to predict the death rate four weeks into the future using
209- lagged[ ^ 2 ] deaths and cases
217+ lagged[ ^ 2 ] deaths and cases.
210218
211219``` r
212220four_week_ahead <- arx_forecaster(
213- cases_deaths | > filter(time_value < = forecast_date ),
221+ covid_case_death_rates | > filter(time_value < = forecast_date ),
214222 outcome = " death_rate" ,
215223 predictors = c(" case_rate" , " death_rate" ),
216224 args_list = arx_args_list(
@@ -221,31 +229,31 @@ four_week_ahead <- arx_forecaster(
221229)
222230four_week_ahead
223231# > ══ A basic forecaster of type ARX Forecaster ════════════════════════════════
224- # >
232+ # >
225233# > This forecaster was fit on 2025-02-10 12:09:58.
226- # >
234+ # >
227235# > Training data was an <epi_df> with:
228236# > • Geography: state,
229237# > • Time type: day,
230238# > • Using data up-to-date as of: 2022-01-01.
231239# > • With the last data available on 2021-08-01
232- # >
240+ # >
233241# > ── Predictions ──────────────────────────────────────────────────────────────
234- # >
242+ # >
235243# > A total of 4 predictions are available for
236244# > • 4 unique geographic regions,
237245# > • At forecast date: 2021-08-01,
238246# > • For target date: 2021-08-29,
239- # >
247+ # >
240248```
241249
242- In this case, we have used 0-3 days, a week, and two week lags for the
243- case rate, while using only zero, one and two weekly lags for the death
244- rate (as predictors). The result ` four_week_ahead ` is both a fitted
250+ In our model setup, we are defining as our predictors case rate lagged 0-3 days, one week, and two weeks, and death rate lagged 0-2 weeks.
251+ The result ` four_week_ahead ` is both a fitted
245252model object which could be used any time in the future to create
246- different forecasts, as well as a set of predicted values (and
253+ different forecasts, and a set of predicted values (and
247254prediction intervals) for each location 28 days after the forecast date.
248- Plotting the prediction intervals on our subset above[ ^ 3 ] :
255+
256+ Plotting the prediction intervals on the true values for our location subset[ ^ 3 ] :
249257
250258<details >
251259<summary >
@@ -275,28 +283,29 @@ forecast_plot <-
275283
276284<img src =" man/figures/README-show-single-forecast-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
277285
278- And as a tibble of quantile level -value pairs:
286+ And as a tibble of quantile-value pairs:
279287
280288``` r
281289four_week_ahead $ predictions | >
282290 select(- .pred ) | >
283291 pivot_quantiles_longer(.pred_distn )
284292# > # A tibble: 20 × 5
285293# > geo_value values quantile_levels forecast_date target_date
286- # > <chr> <dbl> <dbl> <date> <date>
287- # > 1 ca 0.199 0.1 2021-08-01 2021-08-29
288- # > 2 ca 0.285 0.25 2021-08-01 2021-08-29
289- # > 3 ca 0.345 0.5 2021-08-01 2021-08-29
290- # > 4 ca 0.405 0.75 2021-08-01 2021-08-29
291- # > 5 ca 0.491 0.9 2021-08-01 2021-08-29
292- # > 6 ma 0.0285 0.1 2021-08-01 2021-08-29
294+ # > <chr> <dbl> <dbl> <date> <date>
295+ # > 1 ca 0.199 0.1 2021-08-01 2021-08-29
296+ # > 2 ca 0.285 0.25 2021-08-01 2021-08-29
297+ # > 3 ca 0.345 0.5 2021-08-01 2021-08-29
298+ # > 4 ca 0.405 0.75 2021-08-01 2021-08-29
299+ # > 5 ca 0.491 0.9 2021-08-01 2021-08-29
300+ # > 6 ma 0.0285 0.1 2021-08-01 2021-08-29
293301# > # ℹ 14 more rows
294302```
295303
296- The black dot gives the median prediction, while the blue intervals give
304+ The orange dot gives the predicted median, and the blue intervals give
297305the 25-75%, the 10-90%, and 2.5-97.5% inter-quantile ranges[ ^ 4 ] . For
298306this particular day and these locations, the forecasts are relatively
299- accurate, with the true data being at least within the 10-90% interval.
307+ accurate, with the true data being at worst within the 10-90% interval.
308+
300309A couple of things to note:
301310
3023111 . Our methods are primarily direct forecasters; this means we don’t
@@ -310,12 +319,11 @@ A couple of things to note:
310319## Getting Help
311320
312321If you encounter a bug or have a feature request, feel free to file an
313- [ issue on our github
322+ [ issue on our GitHub
314323page] ( https://github.com/cmu-delphi/epipredict/issues ) . For other
315324questions, feel free to reach out to the authors, either via this
316- [ contact
317- form] ( https://docs.google.com/forms/d/e/1FAIpQLScqgT1fKZr5VWBfsaSp-DNaN03aV6EoZU4YljIzHJ1Wl_zmtg/viewform ) ,
318- email, or the Insightnet slack.
325+ [ contact form] ( https://docs.google.com/forms/d/e/1FAIpQLScqgT1fKZr5VWBfsaSp-DNaN03aV6EoZU4YljIzHJ1Wl_zmtg/viewform ) ,
326+ email, or the InsightNet Slack.
319327
320328[ ^ 1 ] : This makes it so that any given day of the processed time-series
321329 only depends on the previous week, which means that we avoid leaking
0 commit comments