@@ -59,7 +59,7 @@ from JHU data.
5959Creating the dataset using ` {epidatr} ` and ` {epiprocess} `
6060</summary >
6161
62- This dataset can be found in the package as \< TODO DOESN’T EXIST \> ; we
62+ This dataset can be found in the package as ` covid_case_death_rates ` ; we
6363demonstrate some of the typically ubiquitous cleaning operations needed
6464to be able to forecast. First we pull both jhu-csse cases and deaths
6565from [ ` {epidatr} ` ] ( https://cmu-delphi.github.io/epidatr/ ) package:
@@ -84,26 +84,35 @@ deaths <- pub_covidcast(
8484 geo_values = " *"
8585) | >
8686 select(geo_value , time_value , death_rate = value )
87+ ```
88+
89+ Since visualizing the results on every geography is somewhat
90+ overwhelming, we’ll only train on a subset of 5.
91+
92+ ``` r
93+ used_locations <- c(" ca" , " ma" , " ny" , " tx" )
8794cases_deaths <-
8895 full_join(cases , deaths , by = c(" time_value" , " geo_value" )) | >
96+ filter(geo_value %in% used_locations ) | >
8997 as_epi_df(as_of = as.Date(" 2022-01-01" ))
90- plot_locations <- c(" ca" , " ma" , " ny" , " tx" )
9198# plotting the data as it was downloaded
9299cases_deaths | >
93- filter(geo_value %in% plot_locations ) | >
94- pivot_longer(cols = c(" case_rate" , " death_rate" ), names_to = " source" ) | >
95- ggplot(aes(x = time_value , y = value )) +
96- geom_line() +
97- facet_grid(source ~ geo_value , scale = " free" ) +
100+ autoplot(
101+ case_rate ,
102+ death_rate ,
103+ .color_by = " none"
104+ ) +
105+ facet_grid(.response_name ~ geo_value , scale = " free" ) +
98106 scale_x_date(date_breaks = " 3 months" , date_labels = " %Y %b" ) +
99107 theme(axis.text.x = element_text(angle = 90 , hjust = 1 ))
100108```
101109
102- <img src =" man/figures/README-case_death -1.png " width =" 90% " style =" display : block ; margin : auto ;" />
110+ <img src =" man/figures/README-date -1.png " width =" 90% " style =" display : block ; margin : auto ;" />
103111
104112As with basically any dataset, there is some cleaning that we will need
105113to do to make it actually usable; we’ll use some utilities from
106114[ ` {epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ ) for this.
115+
107116First, to eliminate some of the noise coming from daily reporting, we do
1081177 day averaging over a trailing window[ ^ 1 ] :
109118
@@ -129,10 +138,12 @@ cases_deaths <-
129138 group_by(geo_value ) | >
130139 mutate(
131140 outlr_death_rate = detect_outlr_rm(
132- time_value , death_rate , detect_negatives = TRUE
141+ time_value , death_rate ,
142+ detect_negatives = TRUE
133143 ),
134144 outlr_case_rate = detect_outlr_rm(
135- time_value , case_rate , detect_negatives = TRUE
145+ time_value , case_rate ,
146+ detect_negatives = TRUE
136147 )
137148 ) | >
138149 unnest(cols = starts_with(" outlr" ), names_sep = " _" ) | >
@@ -142,22 +153,6 @@ cases_deaths <-
142153 case_rate = outlr_case_rate_replacement
143154 ) | >
144155 select(geo_value , time_value , case_rate , death_rate )
145- cases_deaths
146- # > An `epi_df` object, 32,424 x 4 with metadata:
147- # > * geo_type = state
148- # > * time_type = day
149- # > * as_of = 2022-01-01
150- # >
151- # > # A tibble: 32,424 × 4
152- # > geo_value time_value case_rate death_rate
153- # > * <chr> <date> <dbl> <dbl>
154- # > 1 ak 2020-06-01 2.31 0
155- # > 2 ak 2020-06-02 1.94 0
156- # > 3 ak 2020-06-03 2.63 0
157- # > 4 ak 2020-06-04 2.59 0
158- # > 5 ak 2020-06-05 2.43 0
159- # > 6 ak 2020-06-06 2.35 0
160- # > # ℹ 32,418 more rows
161156```
162157
163158</details >
@@ -173,18 +168,19 @@ Plot
173168``` r
174169forecast_date_label <-
175170 tibble(
176- geo_value = rep(plot_locations , 2 ),
177- source = c(rep(" case_rate" , 4 ), rep(" death_rate" , 4 )),
178- dates = rep(forecast_date - 7 * 2 , 2 * length(plot_locations )),
171+ geo_value = rep(used_locations , 2 ),
172+ .response_name = c(rep(" case_rate" , 4 ), rep(" death_rate" , 4 )),
173+ dates = rep(forecast_date - 7 * 2 , 2 * length(used_locations )),
179174 heights = c(rep(150 , 4 ), rep(1.0 , 4 ))
180175 )
181176processed_data_plot <-
182177 cases_deaths | >
183- filter(geo_value %in% plot_locations ) | >
184- pivot_longer(cols = c(" case_rate" , " death_rate" ), names_to = " source" ) | >
185- ggplot(aes(x = time_value , y = value )) +
186- geom_line() +
187- facet_grid(source ~ geo_value , scale = " free" ) +
178+ autoplot(
179+ case_rate ,
180+ death_rate ,
181+ .color_by = " none"
182+ ) +
183+ facet_grid(.response_name ~ geo_value , scale = " free" ) +
188184 geom_vline(aes(xintercept = forecast_date )) +
189185 geom_text(
190186 data = forecast_date_label ,
@@ -216,7 +212,7 @@ four_week_ahead <- arx_forecaster(
216212four_week_ahead
217213# > ══ A basic forecaster of type ARX Forecaster ════════════════════════════════
218214# >
219- # > This forecaster was fit on 2025-01-24 15:31:46 .
215+ # > This forecaster was fit on 2025-01-27 16:36:10 .
220216# >
221217# > Training data was an <epi_df> with:
222218# > • Geography: state,
@@ -226,8 +222,8 @@ four_week_ahead
226222# >
227223# > ── Predictions ──────────────────────────────────────────────────────────────
228224# >
229- # > A total of 56 predictions are available for
230- # > • 56 unique geographic regions,
225+ # > A total of 4 predictions are available for
226+ # > • 4 unique geographic regions,
231227# > • At forecast date: 2021-08-01,
232228# > • For target date: 2021-08-29,
233229# >
@@ -246,58 +242,34 @@ Plotting the prediction intervals on our subset above[^3]:
246242Plot
247243</summary >
248244
249- This is the same kind of plot as ` processed_data_plot ` above, but with
250- the past data narrowed somewhat
251-
252245``` r
253- narrow_data_plot <-
254- cases_deaths | >
255- filter( time_value > " 2021-04-01 " ) | >
256- filter( geo_value %in% plot_locations ) | >
257- pivot_longer( cols = c( " case_rate " , " death_rate" ), names_to = " source " ) | >
258- ggplot(aes( x = time_value , y = value )) +
259- geom_line() +
260- facet_grid( source ~ geo_value , scale = " free " ) +
246+ epiworkflow <- four_week_ahead $ epi_workflow
247+ restricted_predictions <-
248+ four_week_ahead $ predictions | >
249+ rename( time_value = target_date , value = .pred ) | >
250+ mutate( .response_name = " death_rate" )
251+ forecast_plot <-
252+ four_week_ahead | >
253+ autoplot( plot_data = cases_deaths ) +
261254 geom_vline(aes(xintercept = forecast_date )) +
262255 geom_text(
263- data = forecast_date_label ,
256+ data = forecast_date_label % > % filter( .response_name == " death_rate " ) ,
264257 aes(x = dates , label = " forecast\n date" , y = heights ),
265258 size = 3 , hjust = " right"
266259 ) +
267260 scale_x_date(date_breaks = " 3 months" , date_labels = " %Y %b" ) +
268261 theme(axis.text.x = element_text(angle = 90 , hjust = 1 ))
269262```
270263
271- Putting that together with a plot of the bands, and a plot of the median
272- prediction.
273-
274- ``` r
275- epiworkflow <- four_week_ahead $ epi_workflow
276- restricted_predictions <-
277- four_week_ahead $ predictions | >
278- filter(geo_value %in% plot_locations ) | >
279- rename(time_value = target_date , value = .pred ) | >
280- mutate(source = " death_rate" )
281- forecast_plot <-
282- narrow_data_plot | >
283- epipredict ::: plot_bands(
284- restricted_predictions ,
285- levels = 0.9
286- ) +
287- geom_point(
288- data = restricted_predictions ,
289- aes(y = .data $ value )
290- )
291- ```
292-
293264</details >
294265
295266<img src =" man/figures/README-show-single-forecast-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
296267
297- The yellow dot gives the median prediction, while the red interval gives
298- the 5-95% inter-quantile range. For this particular day and these
299- locations, the forecasts are relatively accurate, with the true data
300- being within the 25-75% interval. A couple of things to note:
268+ The black dot gives the median prediction, while the blue intervals give
269+ the 25-75%, the 10-90%, and 2.5-97.5% inter-quantile ranges. For this
270+ particular day and these locations, the forecasts are relatively
271+ accurate, with the true data being within the 25-75% interval. A couple
272+ of things to note:
301273
3022741 . Our methods are primarily direct forecasters; this means we don’t
303275 need to predict 1, 2,…, 27 days ahead to then predict 28 days ahead
@@ -312,10 +284,10 @@ being within the 25-75% interval. A couple of things to note:
312284If you encounter a bug or have a feature request, feel free to file an
313285[ issue on our github
314286page] ( https://github.com/cmu-delphi/epipredict/issues ) . For other
315- questions, feel free to
contact [ Daniel ] ( [email protected] ) , 316- 317- [ Logan ] ( [email protected] ) , either via email or on the Insightnet 318- slack.
287+ questions, feel free to reach out to the authors, either via this
288+ [ contact
289+ form ] ( https://docs.google.com/forms/d/e/1FAIpQLScqgT1fKZr5VWBfsaSp-DNaN03aV6EoZU4YljIzHJ1Wl_zmtg/viewform ) ,
290+ email, or the Insightnet slack.
319291
320292[ ^ 1 ] : This makes it so that any given day of the processed timeseries
321293 only depends on the previous week, which means that we avoid leaking
0 commit comments