Skip to content

using check_enough_train_data in practice #452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 28, 2025
Merged

Conversation

dsweber2
Copy link
Contributor

Checklist

Please:

  • Make sure this PR is against "dev", not "main".
  • Request a review from one of the current epipredict main reviewers:
    dajmcdon.
  • Make sure to bump the version number in DESCRIPTION and NEWS.md.
    Always increment the patch version number (the third number), unless you are
    making a release PR from dev to main, in which case increment the minor
    version number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    0.7.2, then write your changes under the 0.8 heading).
  • Consider pinning the epiprocess version in the DESCRIPTION file if
    • You anticipate breaking changes in epiprocess soon
    • You want to co-develop features in epipredict and epiprocess

Change explanations for reviewer

I was going through the pile of issues and #333 was a good test case for check_enough_train_data, so I added it to arx_forecaster(). To get it working, I also needed to modify check_enough_train_data by replicating the logic in the bake step, since it should be applied most importantly to the test data, and not the train data.

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

@dsweber2 dsweber2 requested a review from dajmcdon as a code owner March 21, 2025 23:00
@dshemetov
Copy link
Contributor

A test time data check seems reasonable! However:

  • don't quite think the n should be the same at prep and bake time
    • for one they seem like they should be separate parameters
    • I'm also concerned the n in the bake step case will tend to always just be 1 (since check_enough_train will come after all predictor construction) ...
    • also feels like this is treading on similar ground that get_oversized_test_data() refactors
  • the fn name should probably remove train then

For instance, in the test you added

test_that("warns if there's not enough data to predict", {
  edf <- tibble(
    geo_value = "ct",
    time_value = seq(as.Date("2020-10-01"), as.Date("2023-05-31"), by = "day"),
  ) %>%
    mutate(value = seq_len(nrow(.)) + rnorm(nrow(.))) %>%
    # Oct to May (flu season, ish) only:
    filter(!between(as.POSIXlt(time_value)$mon + 1L, 6L, 9L)) %>%
    # and actually, pretend we're around mid-October 2022:
    filter(time_value <= as.Date("2022-10-12")) %>%
    as_epi_df(as_of = as.Date("2022-10-12"))
  edf %>% filter(time_value > "2022-08-01")

  debugonce(bake.check_enough_train_data)
  expect_error(
    edf %>% arx_forecaster("value"),
    class = "epipredict__not_enough_train_data"
  )
})

Browse[1]> non_na_data %>%
               group_by(across(all_of(.env$object$epi_keys)))
# A tibble: 444 x 7
   geo_value time_value value lag_0_value lag_7_value lag_14_value ahead_7_value
   <chr>     <date>     <dbl>       <dbl>       <dbl>        <dbl>         <dbl>
 1 ct        2020-10-15  16.4        16.4        7.39        0.565          20.1
 2 ct        2020-10-16  15.9        15.9        9.05        2.86           23.5
 3 ct        2020-10-17  17.6        17.6        8.91        3.24           23.7
 4 ct        2020-10-18  18.2        18.2        8.62        3.11           24.3
 5 ct        2020-10-19  18.3        18.3       13.8         6.69           27.0
 6 ct        2020-10-20  20.1        20.1       13.3         6.16           27.1
 7 ct        2020-10-21  21.0        21.0       16.6         7.52           29.0
 8 ct        2020-10-22  20.1        20.1       16.4         7.39           29.0
 9 ct        2020-10-23  23.5        23.5       15.9         9.05           29.7
10 ct        2020-10-24  23.7        23.7       17.6         8.91           32.4

Browse[1]> object$n
[1] 3

Where n=3 is coming from the number of predictors we have (which is a default in the case of unspecified n), but we certainly can do a prediction here with only one data point.

@dsweber2
Copy link
Contributor Author

for one they seem like they should be separate parameters

Good point.

I'm also concerned the n in the bake step case will tend to always just be 1 (since check_enough_train will come after all predictor construction) ...

I was wondering about the multi-ahead case, but I think that too should only need one point to do the prediction.

also feels like this is treading on similar ground that get_oversized_test_data() refactors

That is happening before any steps are applied, so it can't really make this kind of check. For example, get_test_data can't tell you which columns don't have enough data.

I think I'll switch the default back to skip = TRUE, and adjust the usage in arx_forecaster so that it checks for a fixed n (fixed to 1, unless we can think of other cases).

We may just want to change the name to check_enough_data because it was basically flexible enough to cover train or test before I made this PR. The changes I made narrowed behavior so that it only drops NA's from the specified columns to check, and enables bake-time running if skip = FALSE, which are both things we'd want in either case.

@dshemetov
Copy link
Contributor

dshemetov commented Mar 25, 2025

I was wondering about the multi-ahead case

FWIW there are barriers before we can support multi-ahead forecasters.

That is happening before any steps are applied, so it can't really make this kind of check. For example, get_test_data can't tell you which columns don't have enough data.

Right, but wasn't the gist of our refactor plan basically to push the filtering to after the predict step? So then new_data would first pass through hardhat::forge -> recipes::bake, where it possibly receive the very basic "is there at least 1 data point per group" data check that doesn't seem essential, since the same effective check can phrased as "is there at least prediction per group".

We may just want to change the name to check_enough_data because it was basically flexible enough to cover train or test before I made this PR. The changes I made narrowed behavior so that it only drops NA's from the specified columns to check, and enables bake-time running if skip = FALSE, which are both things we'd want in either case.

Changing name sounds good to me.

@dsweber2
Copy link
Contributor Author

since the same effective check can phrased as "is there at least prediction per group".

Looking at #333, the problem is coming up in the middle of the prediction step, with completely unrelated errors coming through before we ever get the output from predict, so I think having a step that catches it just before fit would be useful

A couple of simultaneous problems that were making this tricky:
1. drop_na can completely remove states
2. checking each column individually misses cases where combinations of
   the states cause the signal to be left out.
3. checking all columns simultaneously doesn't let the user know which
   columns to check.
Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm and I double checked that it correctly gives a helpful error for Logan's example in #333

@dsweber2 dsweber2 self-assigned this Mar 25, 2025
x$n <- length(col_names)
}

check_enough_data_core(training, x, col_names, "train")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed at prep time? Is it enough to only have it process during bake? Prep runs only during model training. Bake runs at train time and at test time (unless skip = TRUE).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the one advantage of having it in both is that it allows us to print The following columns don't have enough data to train: x and y. if it breaks during prep and The following columns don't have enough data to predict: x and y. if it breaks during prediction. It ends up only running once either way, and the tests all pass with it removed from prep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to leave it, but I'm not convinced it does what you want. Is there a corner case in which it can pass at prep time but fail at bake time? In that case, model training would error, but the message would say something about predicting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one could substitute a different dataset between baking and prepping, but substituting a dataset with less data seems unlikely.

@dsweber2 dsweber2 force-pushed the arx_forecastCheckEnough branch from c18c911 to 1b05e99 Compare March 27, 2025 17:49
@dajmcdon
Copy link
Contributor

Just a few minor things remaining.

@dsweber2 dsweber2 merged commit 5721d41 into dev Mar 28, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide better error/recovery when test lags don't exist ("zero-length inputs [...]")
3 participants