using check_enough_train_data in practice #452

dsweber2 · 2025-03-21T23:00:58Z

Checklist

Please:

Make sure this PR is against "dev", not "main".
Request a review from one of the current epipredict main reviewers:
dajmcdon.
Make sure to bump the version number in DESCRIPTION and NEWS.md.
Always increment the patch version number (the third number), unless you are
making a release PR from dev to main, in which case increment the minor
version number (the second number).
Describe changes made in NEWS.md, making sure breaking changes
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
0.7.2, then write your changes under the 0.8 heading).
Consider pinning the epiprocess version in the DESCRIPTION file if
- You anticipate breaking changes in epiprocess soon
- You want to co-develop features in epipredict and epiprocess

Change explanations for reviewer

I was going through the pile of issues and #333 was a good test case for check_enough_train_data, so I added it to arx_forecaster(). To get it working, I also needed to modify check_enough_train_data by replicating the logic in the bake step, since it should be applied most importantly to the test data, and not the train data.

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Resolves Provide better error/recovery when test lags don't exist ("zero-length inputs [...]") #333

dshemetov · 2025-03-24T19:35:21Z

A test time data check seems reasonable! However:

don't quite think the n should be the same at prep and bake time
- for one they seem like they should be separate parameters
- I'm also concerned the n in the bake step case will tend to always just be 1 (since check_enough_train will come after all predictor construction) ...
- also feels like this is treading on similar ground that get_oversized_test_data() refactors
the fn name should probably remove train then

For instance, in the test you added

test_that("warns if there's not enough data to predict", {
  edf <- tibble(
    geo_value = "ct",
    time_value = seq(as.Date("2020-10-01"), as.Date("2023-05-31"), by = "day"),
  ) %>%
    mutate(value = seq_len(nrow(.)) + rnorm(nrow(.))) %>%
    # Oct to May (flu season, ish) only:
    filter(!between(as.POSIXlt(time_value)$mon + 1L, 6L, 9L)) %>%
    # and actually, pretend we're around mid-October 2022:
    filter(time_value <= as.Date("2022-10-12")) %>%
    as_epi_df(as_of = as.Date("2022-10-12"))
  edf %>% filter(time_value > "2022-08-01")

  debugonce(bake.check_enough_train_data)
  expect_error(
    edf %>% arx_forecaster("value"),
    class = "epipredict__not_enough_train_data"
  )
})

Browse[1]> non_na_data %>%
               group_by(across(all_of(.env$object$epi_keys)))
# A tibble: 444 x 7
   geo_value time_value value lag_0_value lag_7_value lag_14_value ahead_7_value
   <chr>     <date>     <dbl>       <dbl>       <dbl>        <dbl>         <dbl>
 1 ct        2020-10-15  16.4        16.4        7.39        0.565          20.1
 2 ct        2020-10-16  15.9        15.9        9.05        2.86           23.5
 3 ct        2020-10-17  17.6        17.6        8.91        3.24           23.7
 4 ct        2020-10-18  18.2        18.2        8.62        3.11           24.3
 5 ct        2020-10-19  18.3        18.3       13.8         6.69           27.0
 6 ct        2020-10-20  20.1        20.1       13.3         6.16           27.1
 7 ct        2020-10-21  21.0        21.0       16.6         7.52           29.0
 8 ct        2020-10-22  20.1        20.1       16.4         7.39           29.0
 9 ct        2020-10-23  23.5        23.5       15.9         9.05           29.7
10 ct        2020-10-24  23.7        23.7       17.6         8.91           32.4

Browse[1]> object$n
[1] 3

Where n=3 is coming from the number of predictors we have (which is a default in the case of unspecified n), but we certainly can do a prediction here with only one data point.

R/arx_forecaster.R

dsweber2 · 2025-03-24T23:06:23Z

for one they seem like they should be separate parameters

Good point.

I'm also concerned the n in the bake step case will tend to always just be 1 (since check_enough_train will come after all predictor construction) ...

I was wondering about the multi-ahead case, but I think that too should only need one point to do the prediction.

also feels like this is treading on similar ground that get_oversized_test_data() refactors

That is happening before any steps are applied, so it can't really make this kind of check. For example, get_test_data can't tell you which columns don't have enough data.

I think I'll switch the default back to skip = TRUE, and adjust the usage in arx_forecaster so that it checks for a fixed n (fixed to 1, unless we can think of other cases).

We may just want to change the name to check_enough_data because it was basically flexible enough to cover train or test before I made this PR. The changes I made narrowed behavior so that it only drops NA's from the specified columns to check, and enables bake-time running if skip = FALSE, which are both things we'd want in either case.

dshemetov · 2025-03-25T00:35:29Z

I was wondering about the multi-ahead case

FWIW there are barriers before we can support multi-ahead forecasters.

That is happening before any steps are applied, so it can't really make this kind of check. For example, get_test_data can't tell you which columns don't have enough data.

Right, but wasn't the gist of our refactor plan basically to push the filtering to after the predict step? So then new_data would first pass through hardhat::forge -> recipes::bake, where it possibly receive the very basic "is there at least 1 data point per group" data check that doesn't seem essential, since the same effective check can phrased as "is there at least prediction per group".

We may just want to change the name to check_enough_data because it was basically flexible enough to cover train or test before I made this PR. The changes I made narrowed behavior so that it only drops NA's from the specified columns to check, and enables bake-time running if skip = FALSE, which are both things we'd want in either case.

Changing name sounds good to me.

dsweber2 · 2025-03-25T16:20:35Z

since the same effective check can phrased as "is there at least prediction per group".

Looking at #333, the problem is coming up in the middle of the prediction step, with completely unrelated errors coming through before we ever get the output from predict, so I think having a step that catches it just before fit would be useful

A couple of simultaneous problems that were making this tricky: 1. drop_na can completely remove states 2. checking each column individually misses cases where combinations of the states cause the signal to be left out. 3. checking all columns simultaneously doesn't let the user know which columns to check.

R/check_enough_data.R

dshemetov

lgtm and I double checked that it correctly gives a helpful error for Logan's example in #333

R/check_enough_data.R

dajmcdon · 2025-03-25T21:13:56Z

R/check_enough_data.R

+    x$n <- length(col_names)
+  }
+
+  check_enough_data_core(training, x, col_names, "train")


Is this needed at prep time? Is it enough to only have it process during bake? Prep runs only during model training. Bake runs at train time and at test time (unless skip = TRUE).

So the one advantage of having it in both is that it allows us to print The following columns don't have enough data to train: x and y. if it breaks during prep and The following columns don't have enough data to predict: x and y. if it breaks during prediction. It ends up only running once either way, and the tests all pass with it removed from prep.

I'm fine to leave it, but I'm not convinced it does what you want. Is there a corner case in which it can pass at prep time but fail at bake time? In that case, model training would error, but the message would say something about predicting.

I guess one could substitute a different dataset between baking and prepping, but substituting a dataset with less data seems unlikely.

R/epi_workflow.R

R/check_enough_data.R

dajmcdon · 2025-03-28T16:17:59Z

Just a few minor things remaining.

dsweber2 added 2 commits March 21, 2025 17:56

using check_enough_train_data in practice

f421a2d

news & description

b06e7e6

dsweber2 requested a review from dajmcdon as a code owner March 21, 2025 23:00

actually passing all the tests

2940f95

dshemetov reviewed Mar 24, 2025

View reviewed changes

R/arx_forecaster.R Show resolved Hide resolved

rename, old default skip, news, all_outcomes() for test, docs

5ef9823

dshemetov reviewed Mar 25, 2025

View reviewed changes

R/check_enough_data.R Show resolved Hide resolved

dsweber2 and others added 2 commits March 25, 2025 13:09

note about weird branch

bba711e

doc: add comments

a899188

dshemetov approved these changes Mar 25, 2025

View reviewed changes

doc: another comment for a weird R idiom

4e37fe6

dsweber2 self-assigned this Mar 25, 2025

dajmcdon requested changes Mar 25, 2025

View reviewed changes

dsweber2 added 2 commits March 26, 2025 11:15

extraneous test_data

51a1c6c

suggestions from Dan

1b05e99

dsweber2 force-pushed the arx_forecastCheckEnough branch from c18c911 to 1b05e99 Compare March 27, 2025 17:49

doc renaming

1499279

dajmcdon approved these changes Mar 28, 2025

View reviewed changes

rename, check for pre in names, dplyr::between

84f991d

dsweber2 merged commit 5721d41 into dev Mar 28, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

using check_enough_train_data in practice #452

using check_enough_train_data in practice #452

Uh oh!

dsweber2 commented Mar 21, 2025

Uh oh!

dshemetov commented Mar 24, 2025

Uh oh!

Uh oh!

dsweber2 commented Mar 24, 2025

Uh oh!

dshemetov commented Mar 25, 2025 •

edited

Loading

Uh oh!

dsweber2 commented Mar 25, 2025

Uh oh!

Uh oh!

dshemetov left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dajmcdon Mar 25, 2025

Uh oh!

dsweber2 Mar 27, 2025

Uh oh!

dajmcdon Mar 28, 2025

Uh oh!

dsweber2 Mar 28, 2025

Uh oh!

Uh oh!

Uh oh!

dajmcdon commented Mar 28, 2025

Uh oh!

Uh oh!

Uh oh!

using check_enough_train_data in practice #452

using check_enough_train_data in practice #452

Uh oh!

Conversation

dsweber2 commented Mar 21, 2025

Checklist

Change explanations for reviewer

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Uh oh!

dshemetov commented Mar 24, 2025

Uh oh!

Uh oh!

dsweber2 commented Mar 24, 2025

Uh oh!

dshemetov commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsweber2 commented Mar 25, 2025

Uh oh!

Uh oh!

dshemetov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dajmcdon Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

dsweber2 Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

dajmcdon Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

dsweber2 Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dajmcdon commented Mar 28, 2025

Uh oh!

Uh oh!

Uh oh!

dshemetov commented Mar 25, 2025 •

edited

Loading