Skip to content

Add filter.epi_archive #651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Apr 16, 2025
Merged

Add filter.epi_archive #651

merged 16 commits into from
Apr 16, 2025

Conversation

brookslogan
Copy link
Contributor

@brookslogan brookslogan commented Apr 2, 2025

Checklist

Please:

  • Make sure this PR is against "dev", not "main" (unless this is a release
    PR).
  • Request a review from one of the current main reviewers:
    brookslogan, nmdefries.
  • Makes sure to bump the version number in DESCRIPTION. Always increment
    the patch version number (the third number), unless you are making a
    release PR from dev to main, in which case increment the minor version
    number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    1.7.2, then write your changes under the 1.8 heading).
  • See DEVELOPMENT.md for more information on the development
    process.

Change explanations for reviewer

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

TODO

  • Add tests
  • Complete cut-off comment
  • Consider forbidding use of value columns as well. E.g., value >= 10 will remove revisions of things >= 10 down to < 10, when the intent is probably to treat the epikeytime as missing in corresponding versions (e.g., putting an NA there).
  • [-] Vignette changes?
  • [-] Maybe implement a select.epi_archive and use it in the example to focus on the 7dav signal -> weekly res... not sure if taking Saturdays of Gaussian linear smoothed signal makes sense necessarily, though also not sure if we should have these lined up in the same data structure at all.
    • Deferring to separate PR sometime.

@brookslogan brookslogan marked this pull request as ready for review April 3, 2025 22:30
@brookslogan brookslogan requested a review from dshemetov April 3, 2025 22:32
@brookslogan
Copy link
Contributor Author

@dshemetov GitHub suggested you for review, if you have time. Experimented with some checks to try to error on things that might be expected to work but have details; see tests for some examples. (Probably could map to "uncompactified" logic while remaining scalable & without a bunch of extra logic if we had a duckdb backend.)

}
e <- parent.env(e)
}
TRUE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it will error if you use it inside a function that has "version" or "time_value" defined in its environment? I'm reading this as traversing up the environment chain and stopping short of the globalenv(), which would be most likely to have variables like that defined, but intermediate scopes might still have false positives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I tested it locally and it seems fine. I guess this works because we hit the data mask environment first and break out before we hit the user's function environment? Seems reasonable.

Copy link
Contributor Author

@brookslogan brookslogan Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the data mask environment chain in current dplyr&rlang looks something like
rlang wrapper env -> data bindings env (/ env chain input to as_data_mask env) -> quosure env (chain)
where:

  • rlang wrapper env holds the real data pronoun objects, a ~ override I don't quite understand, and some other internals
  • data bindings env is typically just a single env holding (group's) column bindings; in other contexts, it could be an env chain we fed into as_data_mask (with its "top" ancestor reassigned to point at:)
  • the quosure env

We should stop at the data bindings env and reassign things there.

But I did find an issue along those lines

epidatasets::case_death_rate_archive %>% {.$DT <- copy(.$DT)[, e := 1]; .} %>% filter(e < 2)
#> Error in `filter()`:
#> ℹ In argument: `e < 2`.
#> Caused by error in `e < 2`:
#> ! comparison (<) is possible only for atomic and list types
#> Run `rlang::last_trace()` to see where the error occurred.

because I'm leaving around an e in the rlang wrapper env.

Also, I fell for a classic lazy eval + env issue

epidatasets::case_death_rate_archive %>% filter(case_rate_7d_av < 2)
#> Error in `filter()`:
#> ℹ In argument: `case_rate_7d_av < 2`.
#> Caused by error:
#> ! Using `death_rate_7d_av` in `filter.epi_archive` may produce unexpected results.
#> → See `?filter.epi_archive` details for how to proceed.
#> Run `rlang::last_trace()` to see where the error occurred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and found some others. Should be fixed now.

Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this looks good to me. Hard to anticipate all future issues with the archive metadata after the filtering, but I think you made the right calls in choosing the defaults (only reinferring the time_type in the day case and inheriting otherwise) and explaining the counter-intuitive aspects.

@brookslogan
Copy link
Contributor Author

brookslogan commented Apr 16, 2025

  • avoid leaving e binding etc. (via rm or operating within local)
  • fix using presumably the last measurement colname in error for everything (more local?)
  • use :: liberally to ensure we don't attempt to use user function [Don't think this is actually necessary except for in very edge-case-y delayedAssign stuff, since we should have our own quosure env that's not the user's quosure env. User would have to assign within the data mask, which seems unlikely. Also it's very ugly to use :: for binary ops. Maybe turn local approach to 1st point to with approach for this.] [It seemed better to just manually construct an envs with our own namespace as the direct parent.]

@brookslogan
Copy link
Contributor Author

brookslogan commented Apr 16, 2025

@dshemetov along with the lazy eval gotcha fixes, I've patched some (unrelated, but check-triggering due to upstream updates) issues regarding Date typeof strictness in the package code and in the testing code. Maybe I'll just merge for now, but if the latter part raises flags in your mind, please note in #662 and up its priority & timeline.

@brookslogan brookslogan merged commit 4366b59 into dev Apr 16, 2025
3 checks passed
@brookslogan brookslogan deleted the lcb/archive-filter branch April 16, 2025 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add filter for epi_archive
2 participants