-
Notifications
You must be signed in to change notification settings - Fork 10
using check_enough_train_data in practice #452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A test time data check seems reasonable! However:
For instance, in the test you added test_that("warns if there's not enough data to predict", {
edf <- tibble(
geo_value = "ct",
time_value = seq(as.Date("2020-10-01"), as.Date("2023-05-31"), by = "day"),
) %>%
mutate(value = seq_len(nrow(.)) + rnorm(nrow(.))) %>%
# Oct to May (flu season, ish) only:
filter(!between(as.POSIXlt(time_value)$mon + 1L, 6L, 9L)) %>%
# and actually, pretend we're around mid-October 2022:
filter(time_value <= as.Date("2022-10-12")) %>%
as_epi_df(as_of = as.Date("2022-10-12"))
edf %>% filter(time_value > "2022-08-01")
debugonce(bake.check_enough_train_data)
expect_error(
edf %>% arx_forecaster("value"),
class = "epipredict__not_enough_train_data"
)
})
Browse[1]> non_na_data %>%
group_by(across(all_of(.env$object$epi_keys)))
# A tibble: 444 x 7
geo_value time_value value lag_0_value lag_7_value lag_14_value ahead_7_value
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ct 2020-10-15 16.4 16.4 7.39 0.565 20.1
2 ct 2020-10-16 15.9 15.9 9.05 2.86 23.5
3 ct 2020-10-17 17.6 17.6 8.91 3.24 23.7
4 ct 2020-10-18 18.2 18.2 8.62 3.11 24.3
5 ct 2020-10-19 18.3 18.3 13.8 6.69 27.0
6 ct 2020-10-20 20.1 20.1 13.3 6.16 27.1
7 ct 2020-10-21 21.0 21.0 16.6 7.52 29.0
8 ct 2020-10-22 20.1 20.1 16.4 7.39 29.0
9 ct 2020-10-23 23.5 23.5 15.9 9.05 29.7
10 ct 2020-10-24 23.7 23.7 17.6 8.91 32.4
Browse[1]> object$n
[1] 3 Where n=3 is coming from the number of predictors we have (which is a default in the case of unspecified n), but we certainly can do a prediction here with only one data point. |
Good point.
I was wondering about the multi-ahead case, but I think that too should only need one point to do the prediction.
That is happening before any steps are applied, so it can't really make this kind of check. For example, I think I'll switch the default back to We may just want to change the name to |
FWIW there are barriers before we can support multi-ahead forecasters.
Right, but wasn't the gist of our refactor plan basically to push the filtering to after the predict step? So then new_data would first pass through hardhat::forge -> recipes::bake, where it possibly receive the very basic "is there at least 1 data point per group" data check that doesn't seem essential, since the same effective check can phrased as "is there at least prediction per group".
Changing name sounds good to me. |
Looking at #333, the problem is coming up in the middle of the prediction step, with completely unrelated errors coming through before we ever get the output from |
A couple of simultaneous problems that were making this tricky: 1. drop_na can completely remove states 2. checking each column individually misses cases where combinations of the states cause the signal to be left out. 3. checking all columns simultaneously doesn't let the user know which columns to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm and I double checked that it correctly gives a helpful error for Logan's example in #333
x$n <- length(col_names) | ||
} | ||
|
||
check_enough_data_core(training, x, col_names, "train") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed at prep time? Is it enough to only have it process during bake? Prep runs only during model training. Bake runs at train time and at test time (unless skip = TRUE
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the one advantage of having it in both is that it allows us to print The following columns don't have enough data to train: x and y.
if it breaks during prep and The following columns don't have enough data to predict: x and y.
if it breaks during prediction. It ends up only running once either way, and the tests all pass with it removed from prep
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to leave it, but I'm not convinced it does what you want. Is there a corner case in which it can pass at prep time but fail at bake time? In that case, model training would error, but the message would say something about predicting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one could substitute a different dataset between baking and prepping, but substituting a dataset with less data seems unlikely.
c18c911
to
1b05e99
Compare
Just a few minor things remaining. |
Checklist
Please:
dajmcdon.
DESCRIPTION
andNEWS.md
.Always increment the patch version number (the third number), unless you are
making a release PR from dev to main, in which case increment the minor
version number (the second number).
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
0.7.2, then write your changes under the 0.8 heading).
epiprocess
version in theDESCRIPTION
file ifepiprocess
soonepipredict
andepiprocess
Change explanations for reviewer
I was going through the pile of issues and #333 was a good test case for
check_enough_train_data
, so I added it toarx_forecaster()
. To get it working, I also needed to modifycheck_enough_train_data
by replicating the logic in the bake step, since it should be applied most importantly to the test data, and not the train data.Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch