You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately, this forecast was not particularly accurate, since for example
377
-
`-1.39` is not remotely in the interval `(-0.01, 0.01]`.
381
+
Unfortunately, this forecast was not particularly accurate. All real growth rates were larger than the predicted growth rates, with California (real growth rate `-1.39`) not remotely in the interval (`(-0.01, 0.01]`).
378
382
379
383
380
384
## Fitting multi-key panel data
381
385
382
-
If you have multiple keys that are set in the `epi_df` as `other_keys`,
383
-
`arx_forecaster` will automatically group by those as well.
384
-
For example, predicting the number of graduates in each of the categories in `grad_employ` from above:
386
+
If multiple keys are set in the `epi_df` as `other_keys`,
387
+
`arx_forecaster` will automatically group by those in addition to the required geographic key.
388
+
For example, predicting the number of graduates in each of the categories in `grad_employ_subset` from above:
385
389
386
390
```{r multi_key_forecast, warning=FALSE}
387
391
# only fitting a subset, otherwise there are ~550 distinct pairs, which is bad for plotting
The 8 graphs are all pairs of the `geo_values` (`"Quebec"` and `"British Columbia"`), `edu_quals` (`"Undergraduate degree"` and `"Professional degree"`), and age brackets (`"15 to 34 years"` and `"35 to 64 years"`).
419
+
The 8 graphs represent all combinations of the `geo_values` (`"Quebec"` and `"British Columbia"`), `edu_quals` (`"Undergraduate degree"` and `"Professional degree"`), and age brackets (`"15 to 34 years"` and `"35 to 64 years"`).
411
420
412
421
## Fitting a non-geo-pooled model
413
422
414
-
Because our internal methods fit a single model, to fit a non-geo-pooled model
415
-
that has a different fit for each geography, one either needs a multi-level
416
-
engine (which at the moment parsnip doesn't support), or one needs to map over
423
+
The methods shown so far fit a single model across all geographic regions.
424
+
This is called "geo-pooling".
425
+
To fit a non-geo-pooled model that fits each geography separately, one either needs a multi-level
426
+
engine (which at the moment `{parsnip}` doesn't support), or one needs to loop over
417
427
geographies.
428
+
Here, we're using `purrr::map` to perform the loop.
418
429
419
430
```{r fit_non_geo_pooled, warning=FALSE}
420
431
geo_values <- covid_case_death_rates |>
@@ -441,12 +452,11 @@ all_fits <-
441
452
map_df(all_fits, ~ pluck(., "predictions"))
442
453
```
443
454
444
-
This is both 56 times slower[^7], and uses far less data to fit each model.
445
-
If the geographies are at all comparable, for example by normalization, we would
446
-
get much better results by pooling.
455
+
Fitting separate models for each geography is both 56 times slower[^7] than geo-pooling, and fits each model on far less data.
456
+
If a dataset contains relatively few observations for each geography, fitting a geo-pooled model is likely to produce better, more stable results.
457
+
However, geo-pooling can only be used if values are comparable in meaning and scale across geographies or can be made comparable, for example by normalization.
447
458
448
-
If we wanted to build a geo-aware model, such as one that sets the constant in a
449
-
linear regression fit to be different for each geography, we would need to build a [Custom workflow](custom_epiworkflows) with geography as a factor.
459
+
If we wanted to build a geo-aware model, such as a linear regression with a different intercept for each geography, we would need to build a [custom workflow](custom_epiworkflows) with geography as a factor.
@@ -510,7 +520,7 @@ extending `four_week_ahead` using the custom forecaster framework.
510
520
511
521
## Mathematical description
512
522
513
-
Let's describe in more detail the actual fit model for a more minimal version of
523
+
Let's look at the mathematical details of the model in more detail, using a minimal version of
514
524
`four_week_ahead`:
515
525
516
526
```{r, four_week_again}
@@ -537,9 +547,9 @@ $$
537
547
For example, $a_1$ is `lag_0_death_rate` above, with a value of `r hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_0_death_rate"] `,
538
548
while $a_5$ is `r hardhat::extract_fit_engine(four_week_small$epi_workflow)$coefficients["lag_7_case_rate"] `.
539
549
540
-
The training data for fitting this linear model is created by creating a series
541
-
of columns shifted by the appropriate amount; this makes it so that each row
542
-
without `NA` values is a training point to fit the coefficients $a_0,\ldots, a_6$.
550
+
The training data for fitting this linear model is constructed within the `arx_forecaster()` function by shifting a series
551
+
of columns the appropriate amount -- based on the requested `lags`.
552
+
Each row containing no `NA` values is used as a training observation to fit the coefficients $a_0,\ldots, a_6$.
543
553
544
554
[^4]: in the case of a `{parsnip}` engine which doesn't explicitly predict
545
555
quantiles, these quantiles are created using `layer_residual_quantiles()`,
0 commit comments