From f23c5c93085926f774ec72602a919e189d0c6947 Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Tue, 11 Mar 2025 13:51:08 -0700 Subject: [PATCH 01/16] docs: link-ify `vignette("backtesting", package="epipredict")` `downlit` fails to link these, maybe because we're not on CRAN yet. --- R/methods-epi_archive.R | 3 ++- man/epix_slide.Rd | 2 +- vignettes/epi_archive.Rmd | 3 ++- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/R/methods-epi_archive.R b/R/methods-epi_archive.R index 362fd4ea..4a8cd164 100644 --- a/R/methods-epi_archive.R +++ b/R/methods-epi_archive.R @@ -629,7 +629,8 @@ epix_detailed_restricted_mutate <- function(.data, ...) { #' version-aware: the sliding computation at any given reference time t is #' performed on **data that would have been available as of t**. This function #' is intended for use in accurate backtesting of models; see -#' `vignette("backtesting", package="epipredict")` for a walkthrough. +#' \href{https://cmu-delphi.github.io/epipredict/articles/backtesting.html}{`vignette("backtesting", +#' package="epipredict")`} for a walkthrough. #' #' @param .x An [`epi_archive`] or [`grouped_epi_archive`] object. If ungrouped, #' all data in `x` will be treated as part of a single data group. diff --git a/man/epix_slide.Rd b/man/epix_slide.Rd index cadb1983..1d4009cb 100644 --- a/man/epix_slide.Rd +++ b/man/epix_slide.Rd @@ -115,7 +115,7 @@ behaves similarly to \code{epi_slide()}, with the key exception that it is version-aware: the sliding computation at any given reference time t is performed on \strong{data that would have been available as of t}. This function is intended for use in accurate backtesting of models; see -\code{vignette("backtesting", package="epipredict")} for a walkthrough. +\href{https://cmu-delphi.github.io/epipredict/articles/backtesting.html}{\code{vignette("backtesting", package="epipredict")}} for a walkthrough. } \details{ A few key distinctions between the current function and \code{epi_slide()}: diff --git a/vignettes/epi_archive.Rmd b/vignettes/epi_archive.Rmd index f87ea291..90eadad6 100644 --- a/vignettes/epi_archive.Rmd +++ b/vignettes/epi_archive.Rmd @@ -234,7 +234,8 @@ observation carried forward (LOCF). For more information, see `epix_merge()`. ## Backtesting forecasting models One of the most common use cases of `epiprocess::epi_archive()` object is for -accurate model backtesting. See `vignette("backtesting", package="epipredict")` +accurate model backtesting. See [`vignette("backtesting", +package="epipredict")`](https://cmu-delphi.github.io/epipredict/articles/backtesting.html) for an in-depth demo, using a pre-built forecaster in that package. ## Attribution From 4f90dda7621d0451a6105deb359d659345ce037c Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Tue, 11 Mar 2025 14:05:32 -0700 Subject: [PATCH 02/16] docs: fix `epi_slide()` column packing note --- R/slide.R | 8 +++++--- man/epi_slide.Rd | 8 +++++--- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/R/slide.R b/R/slide.R index abc7c3b7..abc93259 100644 --- a/R/slide.R +++ b/R/slide.R @@ -19,9 +19,11 @@ #' @template basic-slide-params #' @param .f Function, formula, or missing; together with `...` specifies the #' computation to slide. The return of the computation should either be a -#' scalar or a 1-row data frame. Data frame returns will be -#' `tidyr::unpack()`-ed, if named, and will be [`tidyr::pack`]-ed columns, if -#' not named. See examples. +#' scalar or a 1-row data frame; these outputs will be collected and form a +#' new column or columns in the `epi_slide()` result. Data frame returns will +#' be unpacked into multiple columns in the result by default, or +#' [`tidyr::pack`]ed into a single data-frame-type column if you provide a +#' name for such a column. See examples. #' #' - If `.f` is missing, then `...` will specify the computation via #' tidy-evaluation. This is usually the most convenient way to use diff --git a/man/epi_slide.Rd b/man/epi_slide.Rd index f053909d..f5b9f1c8 100644 --- a/man/epi_slide.Rd +++ b/man/epi_slide.Rd @@ -22,9 +22,11 @@ and any columns in \code{other_keys}. If grouped, we make sure the grouping is b \item{.f}{Function, formula, or missing; together with \code{...} specifies the computation to slide. The return of the computation should either be a -scalar or a 1-row data frame. Data frame returns will be -\code{tidyr::unpack()}-ed, if named, and will be \code{\link[tidyr:pack]{tidyr::pack}}-ed columns, if -not named. See examples. +scalar or a 1-row data frame; these outputs will be collected and form a +new column or columns in the \code{epi_slide()} result. Data frame returns will +be unpacked into multiple columns in the result by default, or +\code{\link[tidyr:pack]{tidyr::pack}}ed into a single data-frame-type column if you provide a +name for such a column. See examples. \itemize{ \item If \code{.f} is missing, then \code{...} will specify the computation via tidy-evaluation. This is usually the most convenient way to use From 51ce893d734f160e7ebb88347b3ad2dfa5a9cded Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Wed, 12 Mar 2025 12:40:33 -0700 Subject: [PATCH 03/16] docs: iterate on `epi_slide(.f, ...)` roxygen --- R/slide.R | 57 +++++++++++++++++++++++++++++++++--------------- man/epi_slide.Rd | 45 ++++++++++++++++++++++++-------------- 2 files changed, 68 insertions(+), 34 deletions(-) diff --git a/R/slide.R b/R/slide.R index abc93259..3cf92865 100644 --- a/R/slide.R +++ b/R/slide.R @@ -17,28 +17,49 @@ #' See `vignette("epi_df")` for more examples. #' #' @template basic-slide-params -#' @param .f Function, formula, or missing; together with `...` specifies the -#' computation to slide. The return of the computation should either be a -#' scalar or a 1-row data frame; these outputs will be collected and form a -#' new column or columns in the `epi_slide()` result. Data frame returns will -#' be unpacked into multiple columns in the result by default, or +#' @param .f,... The computation to slide. The input will be a time window of +#' the data for a single `geo_value` --- or a single combination of +#' `geo_value` and any [`other_keys`][as_epi_df] you used to specify +#' demographical breakdowns. The input will always have the same size, +#' determined by `.window_size`, and will fill in any missing `time_values`, +#' using `NA` values for missing measurements. The output should be a scalar +#' value or a 1-row data frame; these outputs will be collected and form a new +#' column or columns in the `epi_slide()` result. Data frame outputs will be +#' unpacked into multiple columns in the result by default, or #' [`tidyr::pack`]ed into a single data-frame-type column if you provide a -#' name for such a column. See examples. +#' name for such a column (e.g., via `.new_col_name`). +#' +#' You can specify the computation in one of the following ways: +#' +#' - Don't provide `.f`, and instead use use one or more +#' [`dplyr::summarize`]-esque ["data-masking"][rlang::args_data_masking] +#' expressions in `...`, e.g., `cases_7dmed = median(cases)`. This is usually +#' the most convenient way to use `epi_slide`. See examples. +#' +#' - Provide a formula in `.f`, e.g., `~ median(.x$cases)`. In this formula, +#' `.x` is an `epi_df` containing data for a single time window as described +#' above, taken from the original `.x` fed into `epi_slide()`. +#' +#' - Provide a function in `.f`. The function should be of the form `function(x, +#' g, t)` or `function(x, g, t, )`, where: #' -#' - If `.f` is missing, then `...` will specify the computation via -#' tidy-evaluation. This is usually the most convenient way to use -#' `epi_slide`. See examples. -#' - If `.f` is a formula, then the formula should use `.x` (not the same as -#' the input `epi_df`) to operate on the columns of the input `epi_df`, e.g. -#' `~mean(.x$var)` to compute a mean of `var`. -#' - If a function, `.f` must have the form `function(x, g, t, ...)`, where: #' - `x` is a data frame with the same column names as the original object, -#' minus any grouping variables, with only the windowed data for one -#' group-`.ref_time_value` combination -#' - `g` is a one-row tibble containing the values of the grouping variables -#' for the associated group +#' minus any grouping variables, with only the windowed data for one +#' group-`.ref_time_value` combination +#' +#' - `g` is a one-row tibble specifying the `geo_value` and value of any +#' `other_keys` for this computation +#' #' - `t` is the `.ref_time_value` for the current window -#' - `...` are additional arguments +#' +#' - If you have a complex `.f` containing ``, you can provide values for those arguments in the `...` +#' argument to `epi_slide()`. +#' +#' The values of `g` and `t` are also available to data-masking expression and +#' formula-based computations as `.group_key` and `.ref_time_value`, +#' respectively. Formula computations also let you use `.y` or `.z`, +#' respectively. #' #' @param ... Additional arguments to pass to the function or formula specified #' via `.f`. Alternatively, if `.f` is missing, then the `...` is interpreted diff --git a/man/epi_slide.Rd b/man/epi_slide.Rd index f5b9f1c8..0405f360 100644 --- a/man/epi_slide.Rd +++ b/man/epi_slide.Rd @@ -20,30 +20,43 @@ epi_slide( and any columns in \code{other_keys}. If grouped, we make sure the grouping is by \code{geo_value} and \code{other_keys}.} -\item{.f}{Function, formula, or missing; together with \code{...} specifies the -computation to slide. The return of the computation should either be a -scalar or a 1-row data frame; these outputs will be collected and form a -new column or columns in the \code{epi_slide()} result. Data frame returns will -be unpacked into multiple columns in the result by default, or +\item{.f, ...}{The computation to slide. The input will be a time window of +the data for a single \code{geo_value} --- or a single combination of +\code{geo_value} and any \code{\link[=as_epi_df]{other_keys}} you used to specify +demographical breakdowns. The input will always have the same size, +determined by \code{.window_size}, and will fill in any missing \code{time_values}, +using \code{NA} values for missing measurements. The output should be a scalar +value or a 1-row data frame; these outputs will be collected and form a new +column or columns in the \code{epi_slide()} result. Data frame outputs will be +unpacked into multiple columns in the result by default, or \code{\link[tidyr:pack]{tidyr::pack}}ed into a single data-frame-type column if you provide a -name for such a column. See examples. +name for such a column (e.g., via \code{.new_col_name}). + +You can specify the computation in one of the following ways: \itemize{ -\item If \code{.f} is missing, then \code{...} will specify the computation via -tidy-evaluation. This is usually the most convenient way to use -\code{epi_slide}. See examples. -\item If \code{.f} is a formula, then the formula should use \code{.x} (not the same as -the input \code{epi_df}) to operate on the columns of the input \code{epi_df}, e.g. -\code{~mean(.x$var)} to compute a mean of \code{var}. -\item If a function, \code{.f} must have the form \verb{function(x, g, t, ...)}, where: +\item Don't provide \code{.f}, and instead use use one or more +\code{\link[dplyr:summarise]{dplyr::summarize}}-esque \link[rlang:args_data_masking]{"data-masking"} +expressions in \code{...}, e.g., \code{cases_7dmed = median(cases)}. This is usually +the most convenient way to use \code{epi_slide}. See examples. +\item Provide a formula in \code{.f}, e.g., \code{~ median(.x$cases)}. In this formula, +\code{.x} is an \code{epi_df} containing data for a single time window as described +above, taken from the original \code{.x} fed into \code{epi_slide()}. +\item Provide a function in \code{.f}. The function should be of the form \verb{function(x, g, t)} or \verb{function(x, g, t, )}, where: \itemize{ \item \code{x} is a data frame with the same column names as the original object, minus any grouping variables, with only the windowed data for one group-\code{.ref_time_value} combination -\item \code{g} is a one-row tibble containing the values of the grouping variables -for the associated group +\item \code{g} is a one-row tibble specifying the \code{geo_value} and value of any +\code{other_keys} for this computation \item \code{t} is the \code{.ref_time_value} for the current window -\item \code{...} are additional arguments +\item If you have a complex \code{.f} containing \verb{}, you can provide values for those arguments in the \code{...} +argument to \code{epi_slide()}. } + +The values of \code{g} and \code{t} are also available to data-masking expression and +formula-based computations as \code{.group_key} and \code{.ref_time_value}, +respectively. Formula computations also let you use \code{.y} or \code{.z}, +respectively. }} \item{...}{Additional arguments to pass to the function or formula specified From 3868c7ddfb1996a73091a71182cab5160c48b64d Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Wed, 12 Mar 2025 15:30:26 -0700 Subject: [PATCH 04/16] docs: iterate on `?epi_slide` intro --- R/slide.R | 11 ++++++----- man/epi_slide.Rd | 16 ++++++++-------- 2 files changed, 14 insertions(+), 13 deletions(-) diff --git a/R/slide.R b/R/slide.R index 3cf92865..017359cf 100644 --- a/R/slide.R +++ b/R/slide.R @@ -1,9 +1,10 @@ -#' Slide a function over variables in an `epi_df` object +#' More general form of [`epi_slide_opt`] for rolling/running computations #' -#' @description Slides a given function over variables in an `epi_df` object. -#' This is useful for computations like rolling averages. The function supports -#' many ways to specify the computation, but by far the most common use case is -#' as follows: +#' Check first whether you can use [`epi_slide_mean`], [`epi_slide_sum`], or the +#' medium-generality [`epi_slide_opt`] instead, as they are faster and more +#' convenient to use. You typically only need to use `epi_slide()` if you have a +#' computation that depends on multiple columns simultaneously, outputs multiple +#' columns simultaneously, or produces non-numeric output. #' #' ``` #' # Create new column `cases_7dmed` that contains a 7-day trailing median of cases diff --git a/man/epi_slide.Rd b/man/epi_slide.Rd index 0405f360..1146da31 100644 --- a/man/epi_slide.Rd +++ b/man/epi_slide.Rd @@ -2,7 +2,7 @@ % Please edit documentation in R/slide.R \name{epi_slide} \alias{epi_slide} -\title{Slide a function over variables in an \code{epi_df} object} +\title{More general form of \code{\link{epi_slide_opt}} for rolling/running computations} \usage{ epi_slide( .x, @@ -110,11 +110,13 @@ added. It will be ungrouped if \code{.x} was ungrouped, and have the same groups as \code{.x} if \code{.x} was grouped. } \description{ -Slides a given function over variables in an \code{epi_df} object. -This is useful for computations like rolling averages. The function supports -many ways to specify the computation, but by far the most common use case is -as follows: - +Check first whether you can use \code{\link{epi_slide_mean}}, \code{\link{epi_slide_sum}}, or the +medium-generality \code{\link{epi_slide_opt}} instead, as they are faster and more +convenient to use. You typically only need to use \code{epi_slide()} if you have a +computation that depends on multiple columns simultaneously, outputs multiple +columns simultaneously, or produces non-numeric output. +} +\details{ \if{html}{\out{
}}\preformatted{# Create new column `cases_7dmed` that contains a 7-day trailing median of cases epi_slide(edf, cases_7dmed = median(cases), .window_size = 7) }\if{html}{\out{
}} @@ -124,8 +126,6 @@ faster than \code{epi_slide}: \code{epi_slide_mean()} and \code{epi_slide_sum()} recommend using these functions when possible. See \code{vignette("epi_df")} for more examples. -} -\details{ \subsection{Advanced uses of \code{.f} via tidy evaluation}{ If specifying \code{.f} via tidy evaluation, in addition to the standard \code{\link{.data}} From 618f84129daeb8c2ad5dc71fedfb4c8138ff47b9 Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Wed, 12 Mar 2025 17:01:01 -0700 Subject: [PATCH 05/16] docs: iterate on `epix_slide()` roxygen --- R/methods-epi_archive.R | 197 ++++++++++++++++++++-------------------- man/epix_slide.Rd | 192 +++++++++++++++++++-------------------- 2 files changed, 191 insertions(+), 198 deletions(-) diff --git a/R/methods-epi_archive.R b/R/methods-epi_archive.R index 4a8cd164..d6e2815e 100644 --- a/R/methods-epi_archive.R +++ b/R/methods-epi_archive.R @@ -622,67 +622,83 @@ epix_detailed_restricted_mutate <- function(.data, ...) { } -#' Slide a function over variables in an `epi_archive` or `grouped_epi_archive` +#' Take each requested (group and) version in an archive, run a computation (e.g., forecast) #' -#' Slides a given function over variables in an `epi_archive` object. This -#' behaves similarly to `epi_slide()`, with the key exception that it is -#' version-aware: the sliding computation at any given reference time t is -#' performed on **data that would have been available as of t**. This function -#' is intended for use in accurate backtesting of models; see +#' ... and collect the results. This is useful for more accurately simulating +#' how a forecaster, nowcaster, or other algorithm would have behaved in real +#' time, factoring in reporting latency and data revisions; see #' \href{https://cmu-delphi.github.io/epipredict/articles/backtesting.html}{`vignette("backtesting", #' package="epipredict")`} for a walkthrough. #' +#' This is similar to looping over versions and calling [`epix_as_of`], but has +#' some conveniences such as working naturally with [`grouped_epi_archive`]s, +#' optional time windowing, and syntactic sugar to make things shorter to write. +#' #' @param .x An [`epi_archive`] or [`grouped_epi_archive`] object. If ungrouped, #' all data in `x` will be treated as part of a single data group. #' @param .f Function, formula, or missing; together with `...` specifies the -#' computation to slide. To "slide" means to apply a computation over a -#' sliding (a.k.a. "rolling") time window for each data group. The window is -#' determined by the `.before` parameter (see details for more). If a -#' function, `.f` must have the form `function(x, g, t, ...)`, where -#' -#' - "x" is an epi_df with the same column names as the archive's `DT`, minus -#' the `version` column -#' - "g" is a one-row tibble containing the values of the grouping variables -#' for the associated group -#' - "t" is the ref_time_value for the current window -#' - "..." are additional arguments +#' computation. The computation will be run on each requested group-version +#' combination, with a time window filter applied if `.before` is supplied. +#' +#' - If `.f` is a function must have the form `function(x, g, v)` or +#' `function(x, g, v, )`, where +#' +#' - `x` is an `epi_df` with the same column names as the archive's `DT`, +#' minus the `version` column. (Or, if `.all_versions = TRUE`, an +#' `epi_archive` with the requested partial version history.) +#' +#' - `g` is a one-row tibble containing the values of the grouping variables +#' for the associated group. +#' +#' - `v` (length-1) is the associated `version` (one of the requested +#' `.versions`) +#' +#' - `` are optional; you can add such +#' arguments to your function and set them by passing them through the +#' `...` argument to `epix_slide()`. #' #' If a formula, `.f` can operate directly on columns accessed via `.x$var` or #' `.$var`, as in `~ mean (.x$var)` to compute a mean of a column `var` for #' each group-`ref_time_value` combination. The group key can be accessed via -#' `.y` or `.group_key`, and the reference time value can be accessed via `.z` -#' or `.ref_time_value`. If `.f` is missing, then `...` will specify the -#' computation. +#' `.y` or `.group_key`, and the reference time value can be accessed via +#' `.z`, `.version`, or `.ref_time_value`. If `.f` is missing, then `...` will +#' specify the computation. #' @param ... Additional arguments to pass to the function or formula specified #' via `f`. Alternatively, if `.f` is missing, then the `...` is interpreted #' as a ["data-masking"][rlang::args_data_masking] expression or expressions #' for tidy evaluation; in addition to referring columns directly by name, the #' expressions have access to `.data` and `.env` pronouns as in `dplyr` verbs, #' and can also refer to `.x` (not the same as the input epi_archive), -#' `.group_key`, and `.ref_time_value`. See details for more. -#' @param .before How many time values before the `.ref_time_value` -#' should each snapshot handed to the function `.f` contain? If provided, it -#' should be a single value that is compatible with the time_type of the -#' time_value column (more below), but most commonly an integer. This window -#' endpoint is inclusive. For example, if `.before = 7`, `time_type` -#' in the archive is "day", and the `.ref_time_value` is January 8, then the -#' smallest time_value in the snapshot will be January 1. If missing, then the -#' default is no limit on the time values, so the full snapshot is given. -#' @param .versions Reference time values / versions for sliding -#' computations; each element of this vector serves both as the anchor point -#' for the `time_value` window for the computation and the `max_version` -#' `epix_as_of` which we fetch data in this window. If missing, then this will -#' set to a regularly-spaced sequence of values set to cover the range of -#' `version`s in the `DT` plus the `versions_end`; the spacing of values will -#' be guessed (using the GCD of the skips between values). +#' `.group_key` and `.version`/`.ref_time_value`. See details for more. +#' @param .before Optional; applies a `time_value` filter before running each +#' computation. The default is not to apply a `time_value` filter. If +#' provided, it should be a single integer or difftime that is compatible with +#' the time_type of the time_value column. If an integer, then the minimum +#' possible `time_value` included will be that many time steps (according to +#' the `time_type`) before each requested `.version`. This window endpoint is +#' inclusive. For example, if `.before = 14`, the `time_type` in the archive +#' is "day", and the requested `.version` is January 15, then the smallest +#' possible `time_value` possible in the snapshot will be January 1. Note that +#' this does not mean that there will be 14 or 15 distinct `time_value`s +#' actually appearing in the data; for most reporting streams, reporting as of +#' January 15 won't include `time_value`s all the way through January 14, due +#' to reporting latency. Unlike `epi_slide()`, `epix_slide()` won't fill in +#' any missing `time_values` in this window. +#' @param .versions Requested versions on which to run the computation. Each +#' requested `.version` also serves as the anchor point around which for which +#' the `time_value` window specified by `.before` is drawn. If `.versions` is +#' missing, it will be set to a regularly-spaced sequence of values set to +#' cover the range of `version`s in the `DT` plus the `versions_end`; the +#' spacing of values will be guessed (using the GCD of the skips between +#' values). #' @param .new_col_name Either `NULL` or a string indicating the name of the new #' column that will contain the derived values. The default, `NULL`, will use #' the name "slide_value" unless your slide computations output data frames, -#' in which case they will be unpacked into the constituent columns and those -#' names used. If the resulting column name(s) overlap with the column names -#' used for labeling the computations, which are `group_vars(x)` and -#' `"version"`, then the values for these columns must be identical to the -#' labels we assign. +#' in which case they will be unpacked into the constituent columns and the +#' data frame's column names will be used instead. If the resulting column +#' name(s) overlap with the column names used for labeling the computations, +#' which are `group_vars(x)` and `"version"`, then the values for these +#' columns must be identical to the labels we assign. #' @param .all_versions (Not the same as `.all_rows` parameter of `epi_slide`.) #' If `.all_versions = TRUE`, then the slide computation will be passed the #' version history (all `version <= .version` where `.version` is one of the @@ -697,16 +713,17 @@ epix_detailed_restricted_mutate <- function(.data, ...) { #' @details A few key distinctions between the current function and `epi_slide()`: #' 1. In `.f` functions for `epix_slide`, one should not assume that the input #' data to contain any rows with `time_value` matching the computation's -#' `.ref_time_value` (accessible via `attributes()$metadata$as_of`); for -#' typical epidemiological surveillance data, observations pertaining to a -#' particular time period (`time_value`) are first reported `as_of` some -#' instant after that time period has ended. +#' `.version`, due to reporting latency; for typical epidemiological +#' surveillance data, observations pertaining to a particular time period +#' (`time_value`) are first reported `as_of` some instant after that time +#' period has ended. No time window completion is performed as in +#' `epi_slide()`. #' 2. The input class and columns are similar but different: `epix_slide` #' (with the default `.all_versions=FALSE`) keeps all columns and the #' `epi_df`-ness of the first argument to each computation; `epi_slide` only #' provides the grouping variables in the second input, and will convert the #' first input into a regular tibble if the grouping variables include the -#' essential `geo_value` column. (With .all_versions=TRUE`, `epix_slide` will +#' essential `geo_value` column. (With `.all_versions=TRUE`, `epix_slide` will #' will provide an `epi_archive` rather than an `epi-df` to each #' computation.) #' 3. The output class and columns are similar but different: `epix_slide()` @@ -726,75 +743,55 @@ epix_detailed_restricted_mutate <- function(.data, ...) { #' computations are allowed more flexibility in their outputs than in #' `epi_slide`, we can't guess a good representation for missing computations #' for excluded group-`.ref_time_value` pairs. -#' 76. The `.versions` default for `epix_slide` is based on making an +#' 6. The `.versions` default for `epix_slide` is based on making an #' evenly-spaced sequence out of the `version`s in the `DT` plus the #' `versions_end`, rather than the `time_value`s. +#' 7. `epix_slide()` computations can refer to the current element of +#' `.versions` as either `.version` or `.ref_time_value`, while `epi_slide()` +#' computations refer to the current element of `.ref_time_values` with +#' `.ref_time_value`. #' #' Apart from the above distinctions, the interfaces between `epix_slide()` and #' `epi_slide()` are the same. #' -#' Furthermore, the current function can be considerably slower than -#' `epi_slide()`, for two reasons: (1) it must repeatedly fetch -#' properly-versioned snapshots from the data archive (via `epix_as_of()`), -#' and (2) it performs a "manual" sliding of sorts, and does not benefit from -#' the highly efficient `slider` package. For this reason, it should never be -#' used in place of `epi_slide()`, and only used when version-aware sliding is -#' necessary (as it its purpose). -#' #' @examples #' library(dplyr) #' -#' # Reference time points for which we want to compute slide values: -#' versions <- seq(as.Date("2020-06-02"), -#' as.Date("2020-06-15"), -#' by = "1 day" -#' ) +#' # Request only a small set of versions, for example's sake: +#' requested_versions <- +#' seq(as.Date("2020-09-02"), as.Date("2020-09-15"), by = "1 day") #' -#' # A simple (but not very useful) example (see the archive vignette for a more -#' # realistic one): +#' # Investigate reporting lag of `percent_cli` signal (though normally we'd +#' # probably work off of the dedicated `revision_summary()` function instead): #' archive_cases_dv_subset %>% -#' group_by(geo_value) %>% #' epix_slide( -#' .f = ~ mean(.x$case_rate_7d_av), -#' .before = 2, -#' .versions = versions, -#' .new_col_name = "case_rate_7d_av_recent_av" -#' ) %>% -#' ungroup() -#' # We requested time windows that started 2 days before the corresponding time -#' # values. The actual number of `time_value`s in each computation depends on -#' # the reporting latency of the signal and `time_value` range covered by the -#' # archive (2020-06-01 -- 2021-11-30 in this example). In this case, we have -#' # * 0 `time_value`s, for ref time 2020-06-01 --> the result is automatically -#' # discarded -#' # * 1 `time_value`, for ref time 2020-06-02 -#' # * 2 `time_value`s, for the rest of the results -#' # * never the 3 `time_value`s we would get from `epi_slide`, since, because -#' # of data latency, we'll never have an observation -#' # `time_value == .ref_time_value` as of `.ref_time_value`. -#' # The example below shows this type of behavior in more detail. -#' -#' # Examining characteristics of the data passed to each computation with -#' # `all_versions=FALSE`. +#' geowide_percent_cli_max_time = max(time_value[!is.na(percent_cli)]), +#' geowide_percent_cli_rpt_lag = .version - geowide_percent_cli_max_time, +#' .versions = requested_versions +#' ) #' archive_cases_dv_subset %>% #' group_by(geo_value) %>% #' epix_slide( -#' function(x, gk, rtv) { -#' tibble( -#' time_range = if (nrow(x) == 0L) { -#' "0 `time_value`s" -#' } else { -#' sprintf("%s -- %s", min(x$time_value), max(x$time_value)) -#' }, -#' n = nrow(x), -#' class1 = class(x)[[1L]] -#' ) -#' }, -#' .before = 5, .all_versions = FALSE, -#' .versions = versions -#' ) %>% -#' ungroup() %>% -#' arrange(geo_value, version) +#' percent_cli_max_time = max(time_value[!is.na(percent_cli)]), +#' percent_cli_rpt_lag = .version - percent_cli_max_time, +#' .versions = requested_versions +#' ) +#' +#' # Backtest a forecaster "pseudoprospectively" (i.e., faithfully with respect +#' # to the data version history): +#' case_death_rate_archive %>% +#' epix_slide( +#' .versions = as.Date(c("2021-10-01", "2021-10-08")), +#' function(x, g, v) { +#' epipredict::arx_forecaster( +#' x, +#' outcome = "death_rate", +#' predictors = c("death_rate_7d_av", "case_rate_7d_av") +#' )$predictions +#' } +#' ) +#' # See `vignette("backtesting", package="epipredict")` for a full walkthrough +#' # on backtesting forecasters, including plots, etc. #' #' # --- Advanced: --- #' diff --git a/man/epix_slide.Rd b/man/epix_slide.Rd index 1d4009cb..06c4972d 100644 --- a/man/epix_slide.Rd +++ b/man/epix_slide.Rd @@ -4,7 +4,7 @@ \alias{epix_slide} \alias{epix_slide.epi_archive} \alias{epix_slide.grouped_epi_archive} -\title{Slide a function over variables in an \code{epi_archive} or \code{grouped_epi_archive}} +\title{Take each requested (group and) version in an archive, run a computation (e.g., forecast)} \usage{ epix_slide( .x, @@ -41,25 +41,31 @@ epix_slide( all data in \code{x} will be treated as part of a single data group.} \item{.f}{Function, formula, or missing; together with \code{...} specifies the -computation to slide. To "slide" means to apply a computation over a -sliding (a.k.a. "rolling") time window for each data group. The window is -determined by the \code{.before} parameter (see details for more). If a -function, \code{.f} must have the form \verb{function(x, g, t, ...)}, where +computation. The computation will be run on each requested group-version +combination, with a time window filter applied if \code{.before} is supplied. \itemize{ -\item "x" is an epi_df with the same column names as the archive's \code{DT}, minus -the \code{version} column -\item "g" is a one-row tibble containing the values of the grouping variables -for the associated group -\item "t" is the ref_time_value for the current window -\item "..." are additional arguments +\item If \code{.f} is a function must have the form \verb{function(x, g, v)} or +\verb{function(x, g, v, )}, where +\itemize{ +\item \code{x} is an \code{epi_df} with the same column names as the archive's \code{DT}, +minus the \code{version} column. (Or, if \code{.all_versions = TRUE}, an +\code{epi_archive} with the requested partial version history.) +\item \code{g} is a one-row tibble containing the values of the grouping variables +for the associated group. +\item \code{v} (length-1) is the associated \code{version} (one of the requested +\code{.versions}) +\item \verb{} are optional; you can add such +arguments to your function and set them by passing them through the +\code{...} argument to \code{epix_slide()}. +} } If a formula, \code{.f} can operate directly on columns accessed via \code{.x$var} or \code{.$var}, as in \code{~ mean (.x$var)} to compute a mean of a column \code{var} for each group-\code{ref_time_value} combination. The group key can be accessed via -\code{.y} or \code{.group_key}, and the reference time value can be accessed via \code{.z} -or \code{.ref_time_value}. If \code{.f} is missing, then \code{...} will specify the -computation.} +\code{.y} or \code{.group_key}, and the reference time value can be accessed via +\code{.z}, \code{.version}, or \code{.ref_time_value}. If \code{.f} is missing, then \code{...} will +specify the computation.} \item{...}{Additional arguments to pass to the function or formula specified via \code{f}. Alternatively, if \code{.f} is missing, then the \code{...} is interpreted @@ -67,33 +73,39 @@ as a \link[rlang:args_data_masking]{"data-masking"} expression or expressions for tidy evaluation; in addition to referring columns directly by name, the expressions have access to \code{.data} and \code{.env} pronouns as in \code{dplyr} verbs, and can also refer to \code{.x} (not the same as the input epi_archive), -\code{.group_key}, and \code{.ref_time_value}. See details for more.} +\code{.group_key} and \code{.version}/\code{.ref_time_value}. See details for more.} -\item{.before}{How many time values before the \code{.ref_time_value} -should each snapshot handed to the function \code{.f} contain? If provided, it -should be a single value that is compatible with the time_type of the -time_value column (more below), but most commonly an integer. This window -endpoint is inclusive. For example, if \code{.before = 7}, \code{time_type} -in the archive is "day", and the \code{.ref_time_value} is January 8, then the -smallest time_value in the snapshot will be January 1. If missing, then the -default is no limit on the time values, so the full snapshot is given.} +\item{.before}{Optional; applies a \code{time_value} filter before running each +computation. The default is not to apply a \code{time_value} filter. If +provided, it should be a single integer or difftime that is compatible with +the time_type of the time_value column. If an integer, then the minimum +possible \code{time_value} included will be that many time steps (according to +the \code{time_type}) before each requested \code{.version}. This window endpoint is +inclusive. For example, if \code{.before = 14}, the \code{time_type} in the archive +is "day", and the requested \code{.version} is January 15, then the smallest +possible \code{time_value} possible in the snapshot will be January 1. Note that +this does not mean that there will be 14 or 15 distinct \code{time_value}s +actually appearing in the data; for most reporting streams, reporting as of +January 15 won't include \code{time_value}s all the way through January 14, due +to reporting latency. Unlike \code{epi_slide()}, \code{epix_slide()} won't fill in +any missing \code{time_values} in this window.} -\item{.versions}{Reference time values / versions for sliding -computations; each element of this vector serves both as the anchor point -for the \code{time_value} window for the computation and the \code{max_version} -\code{epix_as_of} which we fetch data in this window. If missing, then this will -set to a regularly-spaced sequence of values set to cover the range of -\code{version}s in the \code{DT} plus the \code{versions_end}; the spacing of values will -be guessed (using the GCD of the skips between values).} +\item{.versions}{Requested versions on which to run the computation. Each +requested \code{.version} also serves as the anchor point around which for which +the \code{time_value} window specified by \code{.before} is drawn. If \code{.versions} is +missing, it will be set to a regularly-spaced sequence of values set to +cover the range of \code{version}s in the \code{DT} plus the \code{versions_end}; the +spacing of values will be guessed (using the GCD of the skips between +values).} \item{.new_col_name}{Either \code{NULL} or a string indicating the name of the new column that will contain the derived values. The default, \code{NULL}, will use the name "slide_value" unless your slide computations output data frames, -in which case they will be unpacked into the constituent columns and those -names used. If the resulting column name(s) overlap with the column names -used for labeling the computations, which are \code{group_vars(x)} and -\code{"version"}, then the values for these columns must be identical to the -labels we assign.} +in which case they will be unpacked into the constituent columns and the +data frame's column names will be used instead. If the resulting column +name(s) overlap with the column names used for labeling the computations, +which are \code{group_vars(x)} and \code{"version"}, then the values for these +columns must be identical to the labels we assign.} \item{.all_versions}{(Not the same as \code{.all_rows} parameter of \code{epi_slide}.) If \code{.all_versions = TRUE}, then the slide computation will be passed the @@ -110,28 +122,32 @@ computation, and a column named according to the \code{.new_col_name} argument, containing the slide values. It will be grouped by the grouping variables. } \description{ -Slides a given function over variables in an \code{epi_archive} object. This -behaves similarly to \code{epi_slide()}, with the key exception that it is -version-aware: the sliding computation at any given reference time t is -performed on \strong{data that would have been available as of t}. This function -is intended for use in accurate backtesting of models; see +... and collect the results. This is useful for more accurately simulating +how a forecaster, nowcaster, or other algorithm would have behaved in real +time, factoring in reporting latency and data revisions; see \href{https://cmu-delphi.github.io/epipredict/articles/backtesting.html}{\code{vignette("backtesting", package="epipredict")}} for a walkthrough. } \details{ +This is similar to looping over versions and calling \code{\link{epix_as_of}}, but has +some conveniences such as working naturally with \code{\link{grouped_epi_archive}}s, +optional time windowing, and syntactic sugar to make things shorter to write. + A few key distinctions between the current function and \code{epi_slide()}: \enumerate{ \item In \code{.f} functions for \code{epix_slide}, one should not assume that the input data to contain any rows with \code{time_value} matching the computation's -\code{.ref_time_value} (accessible via \verb{attributes()$metadata$as_of}); for -typical epidemiological surveillance data, observations pertaining to a -particular time period (\code{time_value}) are first reported \code{as_of} some -instant after that time period has ended. +\code{.version}, due to reporting latency; for typical epidemiological +surveillance data, observations pertaining to a particular time period +(\code{time_value}) are first reported \code{as_of} some instant after that time +period has ended. No time window completion is performed as in +\code{epi_slide()}. \item The input class and columns are similar but different: \code{epix_slide} (with the default \code{.all_versions=FALSE}) keeps all columns and the \code{epi_df}-ness of the first argument to each computation; \code{epi_slide} only provides the grouping variables in the second input, and will convert the first input into a regular tibble if the grouping variables include the -essential \code{geo_value} column. (With .all_versions=TRUE\verb{, }epix_slide\verb{will will provide an}epi_archive\verb{rather than an}epi-df` to each +essential \code{geo_value} column. (With \code{.all_versions=TRUE}, \code{epix_slide} will +will provide an \code{epi_archive} rather than an \code{epi-df} to each computation.) \item The output class and columns are similar but different: \code{epix_slide()} returns a tibble containing only the grouping variables, \code{time_value}, and @@ -153,73 +169,53 @@ for excluded group-\code{.ref_time_value} pairs. \item The \code{.versions} default for \code{epix_slide} is based on making an evenly-spaced sequence out of the \code{version}s in the \code{DT} plus the \code{versions_end}, rather than the \code{time_value}s. +\item \code{epix_slide()} computations can refer to the current element of +\code{.versions} as either \code{.version} or \code{.ref_time_value}, while \code{epi_slide()} +computations refer to the current element of \code{.ref_time_values} with +\code{.ref_time_value}. } Apart from the above distinctions, the interfaces between \code{epix_slide()} and \code{epi_slide()} are the same. - -Furthermore, the current function can be considerably slower than -\code{epi_slide()}, for two reasons: (1) it must repeatedly fetch -properly-versioned snapshots from the data archive (via \code{epix_as_of()}), -and (2) it performs a "manual" sliding of sorts, and does not benefit from -the highly efficient \code{slider} package. For this reason, it should never be -used in place of \code{epi_slide()}, and only used when version-aware sliding is -necessary (as it its purpose). } \examples{ library(dplyr) -# Reference time points for which we want to compute slide values: -versions <- seq(as.Date("2020-06-02"), - as.Date("2020-06-15"), - by = "1 day" -) +# Request only a small set of versions, for example's sake: +requested_versions <- + seq(as.Date("2020-09-02"), as.Date("2020-09-15"), by = "1 day") -# A simple (but not very useful) example (see the archive vignette for a more -# realistic one): +# Investigate reporting lag of `percent_cli` signal (though normally we'd +# probably work off of the dedicated `revision_summary()` function instead): archive_cases_dv_subset \%>\% - group_by(geo_value) \%>\% epix_slide( - .f = ~ mean(.x$case_rate_7d_av), - .before = 2, - .versions = versions, - .new_col_name = "case_rate_7d_av_recent_av" - ) \%>\% - ungroup() -# We requested time windows that started 2 days before the corresponding time -# values. The actual number of `time_value`s in each computation depends on -# the reporting latency of the signal and `time_value` range covered by the -# archive (2020-06-01 -- 2021-11-30 in this example). In this case, we have -# * 0 `time_value`s, for ref time 2020-06-01 --> the result is automatically -# discarded -# * 1 `time_value`, for ref time 2020-06-02 -# * 2 `time_value`s, for the rest of the results -# * never the 3 `time_value`s we would get from `epi_slide`, since, because -# of data latency, we'll never have an observation -# `time_value == .ref_time_value` as of `.ref_time_value`. -# The example below shows this type of behavior in more detail. - -# Examining characteristics of the data passed to each computation with -# `all_versions=FALSE`. + geowide_percent_cli_max_time = max(time_value[!is.na(percent_cli)]), + geowide_percent_cli_rpt_lag = .version - geowide_percent_cli_max_time, + .versions = requested_versions + ) archive_cases_dv_subset \%>\% group_by(geo_value) \%>\% epix_slide( - function(x, gk, rtv) { - tibble( - time_range = if (nrow(x) == 0L) { - "0 `time_value`s" - } else { - sprintf("\%s -- \%s", min(x$time_value), max(x$time_value)) - }, - n = nrow(x), - class1 = class(x)[[1L]] - ) - }, - .before = 5, .all_versions = FALSE, - .versions = versions - ) \%>\% - ungroup() \%>\% - arrange(geo_value, version) + percent_cli_max_time = max(time_value[!is.na(percent_cli)]), + percent_cli_rpt_lag = .version - percent_cli_max_time, + .versions = requested_versions + ) + +# Backtest a forecaster "pseudoprospectively" (i.e., faithfully with respect +# to the data version history): +case_death_rate_archive \%>\% + epix_slide( + .versions = as.Date(c("2021-10-01", "2021-10-08")), + function(x, g, v) { + epipredict::arx_forecaster( + x, + outcome = "death_rate", + predictors = c("death_rate_7d_av", "case_rate_7d_av") + )$predictions + } + ) +# See `vignette("backtesting", package="epipredict")` for a full walkthrough +# on backtesting forecasters, including plots, etc. # --- Advanced: --- From 4a3eafc746dfd30caa161cb9a984b310b8f8ff8b Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Thu, 13 Mar 2025 10:32:19 -0700 Subject: [PATCH 06/16] docs(epi_slide): use "subpopulation" to help clarify grouping --- R/slide.R | 20 ++++++++++---------- man/epi_slide.Rd | 20 ++++++++++---------- 2 files changed, 20 insertions(+), 20 deletions(-) diff --git a/R/slide.R b/R/slide.R index 017359cf..967f880f 100644 --- a/R/slide.R +++ b/R/slide.R @@ -19,16 +19,16 @@ #' #' @template basic-slide-params #' @param .f,... The computation to slide. The input will be a time window of -#' the data for a single `geo_value` --- or a single combination of -#' `geo_value` and any [`other_keys`][as_epi_df] you used to specify -#' demographical breakdowns. The input will always have the same size, -#' determined by `.window_size`, and will fill in any missing `time_values`, -#' using `NA` values for missing measurements. The output should be a scalar -#' value or a 1-row data frame; these outputs will be collected and form a new -#' column or columns in the `epi_slide()` result. Data frame outputs will be -#' unpacked into multiple columns in the result by default, or -#' [`tidyr::pack`]ed into a single data-frame-type column if you provide a -#' name for such a column (e.g., via `.new_col_name`). +#' the data for a single subpopulation (i.e., a single `geo_value` and single +#' value for any [`other_keys`][as_epi_df] you set up for age groups, etc.). +#' The input will always have the same size, determined by `.window_size`, and +#' will fill in any missing `time_values`, using `NA` values for missing +#' measurements. The output should be a scalar value or a 1-row data frame; +#' these outputs will be collected and form a new column or columns in the +#' `epi_slide()` result. Data frame outputs will be unpacked into multiple +#' columns in the result by default, or [`tidyr::pack`]ed into a single +#' data-frame-type column if you provide a name for such a column (e.g., via +#' `.new_col_name`). #' #' You can specify the computation in one of the following ways: #' diff --git a/man/epi_slide.Rd b/man/epi_slide.Rd index 1146da31..83d1cc61 100644 --- a/man/epi_slide.Rd +++ b/man/epi_slide.Rd @@ -21,16 +21,16 @@ and any columns in \code{other_keys}. If grouped, we make sure the grouping is b \code{geo_value} and \code{other_keys}.} \item{.f, ...}{The computation to slide. The input will be a time window of -the data for a single \code{geo_value} --- or a single combination of -\code{geo_value} and any \code{\link[=as_epi_df]{other_keys}} you used to specify -demographical breakdowns. The input will always have the same size, -determined by \code{.window_size}, and will fill in any missing \code{time_values}, -using \code{NA} values for missing measurements. The output should be a scalar -value or a 1-row data frame; these outputs will be collected and form a new -column or columns in the \code{epi_slide()} result. Data frame outputs will be -unpacked into multiple columns in the result by default, or -\code{\link[tidyr:pack]{tidyr::pack}}ed into a single data-frame-type column if you provide a -name for such a column (e.g., via \code{.new_col_name}). +the data for a single subpopulation (i.e., a single \code{geo_value} and single +value for any \code{\link[=as_epi_df]{other_keys}} you set up for age groups, etc.). +The input will always have the same size, determined by \code{.window_size}, and +will fill in any missing \code{time_values}, using \code{NA} values for missing +measurements. The output should be a scalar value or a 1-row data frame; +these outputs will be collected and form a new column or columns in the +\code{epi_slide()} result. Data frame outputs will be unpacked into multiple +columns in the result by default, or \code{\link[tidyr:pack]{tidyr::pack}}ed into a single +data-frame-type column if you provide a name for such a column (e.g., via +\code{.new_col_name}). You can specify the computation in one of the following ways: \itemize{ From 03bf732fb1b295bc76b664ab77d592178720971a Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Thu, 13 Mar 2025 11:37:56 -0700 Subject: [PATCH 07/16] docs(epi_slide_opt desc): cast as main time slide, +NA behavior, edits --- R/slide.R | 20 +++++++++++++++----- man/epi_slide_opt.Rd | 20 +++++++++++++++----- 2 files changed, 30 insertions(+), 10 deletions(-) diff --git a/R/slide.R b/R/slide.R index 967f880f..ae82816f 100644 --- a/R/slide.R +++ b/R/slide.R @@ -573,12 +573,22 @@ get_before_after_from_window <- function(window_size, align, time_type) { list(before = before, after = after) } -#' Optimized slide functions for common cases +#' Calculate rolling or running means, sums, etc., or custom calculations #' -#' @description `epi_slide_opt` allows sliding an n-timestep [data.table::froll] -#' or [slider::summary-slide] function over variables in an `epi_df` object. -#' These functions tend to be much faster than `epi_slide()`. See -#' `vignette("epi_df")` for more examples. +#' @description These methods take each subpopulation (i.e., a single +#' `geo_value` and combination of any `other_keys` you set up for age groups, +#' etc.) and perform a `.window_size`-width time window rolling/sliding +#' computation, or alternatively, a running/cumulative computation (with +#' `.window_size = Inf`) on the requested columns. Explicit `NA` measurements +#' are temporarily added to fill in any time gaps, and, for rolling +#' computations, to pad the time series to ensure that the first & last +#' computations are over exactly `.window_size` values. +#' +#' `epi_slide_opt` allows you to use any [data.table::froll] or +#' [slider::summary-slide] function. If none of the specialized functions here +#' work, you can use `data.table::frollapply` with your own function. See +#' [`epi_slide`] if you need to work with multiple columns at once or output a +#' custom type. #' #' @template basic-slide-params #' @param .col_names <[`tidy-select`][dplyr_tidy_select]> An unquoted column diff --git a/man/epi_slide_opt.Rd b/man/epi_slide_opt.Rd index 4b75e9ff..c33a4208 100644 --- a/man/epi_slide_opt.Rd +++ b/man/epi_slide_opt.Rd @@ -4,7 +4,7 @@ \alias{epi_slide_opt} \alias{epi_slide_mean} \alias{epi_slide_sum} -\title{Optimized slide functions for common cases} +\title{Calculate rolling or running means, sums, etc., or custom calculations} \usage{ epi_slide_opt( .x, @@ -134,10 +134,20 @@ added. It will be ungrouped if \code{.x} was ungrouped, and have the same groups as \code{.x} if \code{.x} was grouped. } \description{ -\code{epi_slide_opt} allows sliding an n-timestep \link[data.table:froll]{data.table::froll} -or \link[slider:summary-slide]{slider::summary-slide} function over variables in an \code{epi_df} object. -These functions tend to be much faster than \code{epi_slide()}. See -\code{vignette("epi_df")} for more examples. +These methods take each subpopulation (i.e., a single +\code{geo_value} and combination of any \code{other_keys} you set up for age groups, +etc.) and perform a \code{.window_size}-width time window rolling/sliding +computation, or alternatively, a running/cumulative computation (with +\code{.window_size = Inf}) on the requested columns. Explicit \code{NA} measurements +are temporarily added to fill in any time gaps, and, for rolling +computations, to pad the time series to ensure that the first & last +computations are over exactly \code{.window_size} values. + +\code{epi_slide_opt} allows you to use any \link[data.table:froll]{data.table::froll} or +\link[slider:summary-slide]{slider::summary-slide} function. If none of the specialized functions here +work, you can use \code{data.table::frollapply} with your own function. See +\code{\link{epi_slide}} if you need to work with multiple columns at once or output a +custom type. \code{epi_slide_mean} is a wrapper around \code{epi_slide_opt} with \code{.f = data.table::frollmean}. From 68d6c573338e2df3931c53dddf5e06e3ab29428d Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Thu, 13 Mar 2025 12:42:57 -0700 Subject: [PATCH 08/16] docs(epix_slide): remove inaccurate + misformatted `.version - before` Inaccurate in that `before` isn't `.before` and `-` isn't `time_type`-aware. Misformatted in that `-` was interpreted as a bullet point. --- R/methods-epi_archive.R | 8 ++++---- man/epix_slide.Rd | 9 ++++----- 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/R/methods-epi_archive.R b/R/methods-epi_archive.R index d6e2815e..42a90f62 100644 --- a/R/methods-epi_archive.R +++ b/R/methods-epi_archive.R @@ -701,10 +701,10 @@ epix_detailed_restricted_mutate <- function(.data, ...) { #' columns must be identical to the labels we assign. #' @param .all_versions (Not the same as `.all_rows` parameter of `epi_slide`.) #' If `.all_versions = TRUE`, then the slide computation will be passed the -#' version history (all `version <= .version` where `.version` is one of the -#' requested `.versions`) for rows having a `time_value` of at least `.version -#' - before`. Otherwise, the slide computation will be passed only the most -#' recent `version` for every unique `time_value`. Default is `FALSE`. +#' version history (all versions `<= .version` where `.version` is one of the +#' requested `.version`s), in `epi_archive` format. Otherwise, the slide +#' computation will be passed only the most recent `version` for every unique +#' `time_value`, in `epi_df` format. Default is `FALSE`. #' @return A tibble whose columns are: the grouping variables (if any), #' `time_value`, containing the reference time values for the slide #' computation, and a column named according to the `.new_col_name` argument, diff --git a/man/epix_slide.Rd b/man/epix_slide.Rd index 06c4972d..6ab2ad26 100644 --- a/man/epix_slide.Rd +++ b/man/epix_slide.Rd @@ -109,11 +109,10 @@ columns must be identical to the labels we assign.} \item{.all_versions}{(Not the same as \code{.all_rows} parameter of \code{epi_slide}.) If \code{.all_versions = TRUE}, then the slide computation will be passed the -version history (all \code{version <= .version} where \code{.version} is one of the -requested \code{.versions}) for rows having a \code{time_value} of at least `.version -\itemize{ -\item before\verb{. Otherwise, the slide computation will be passed only the most recent }version\verb{for every unique}time_value\verb{. Default is }FALSE`. -}} +version history (all versions \verb{<= .version} where \code{.version} is one of the +requested \code{.version}s), in \code{epi_archive} format. Otherwise, the slide +computation will be passed only the most recent \code{version} for every unique +\code{time_value}, in \code{epi_df} format. Default is \code{FALSE}.} } \value{ A tibble whose columns are: the grouping variables (if any), From 01ab5f4095e6f45dd59b492dc973fe9c94eff070 Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Thu, 13 Mar 2025 14:18:27 -0700 Subject: [PATCH 09/16] docs(epix_slide): remove duplicated word --- R/methods-epi_archive.R | 2 +- man/epix_slide.Rd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/R/methods-epi_archive.R b/R/methods-epi_archive.R index 42a90f62..76a1ab17 100644 --- a/R/methods-epi_archive.R +++ b/R/methods-epi_archive.R @@ -723,7 +723,7 @@ epix_detailed_restricted_mutate <- function(.data, ...) { #' `epi_df`-ness of the first argument to each computation; `epi_slide` only #' provides the grouping variables in the second input, and will convert the #' first input into a regular tibble if the grouping variables include the -#' essential `geo_value` column. (With `.all_versions=TRUE`, `epix_slide` will +#' essential `geo_value` column. (With `.all_versions=TRUE`, `epix_slide` #' will provide an `epi_archive` rather than an `epi-df` to each #' computation.) #' 3. The output class and columns are similar but different: `epix_slide()` diff --git a/man/epix_slide.Rd b/man/epix_slide.Rd index 6ab2ad26..601d5811 100644 --- a/man/epix_slide.Rd +++ b/man/epix_slide.Rd @@ -145,7 +145,7 @@ period has ended. No time window completion is performed as in \code{epi_df}-ness of the first argument to each computation; \code{epi_slide} only provides the grouping variables in the second input, and will convert the first input into a regular tibble if the grouping variables include the -essential \code{geo_value} column. (With \code{.all_versions=TRUE}, \code{epix_slide} will +essential \code{geo_value} column. (With \code{.all_versions=TRUE}, \code{epix_slide} will provide an \code{epi_archive} rather than an \code{epi-df} to each computation.) \item The output class and columns are similar but different: \code{epix_slide()} From 576e97d74d14f2d4887b8b90992fa08f3d1958f4 Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Mon, 17 Mar 2025 12:11:47 -0700 Subject: [PATCH 10/16] docs(epi_slide): iterate on intro, examples, motivation --- R/slide.R | 132 ++++++++++++++++++++++++++++--------------- man/epi_slide.Rd | 123 ++++++++++++++++++++++++++-------------- man/epi_slide_opt.Rd | 6 +- 3 files changed, 171 insertions(+), 90 deletions(-) diff --git a/R/slide.R b/R/slide.R index ae82816f..863f6812 100644 --- a/R/slide.R +++ b/R/slide.R @@ -1,19 +1,27 @@ #' More general form of [`epi_slide_opt`] for rolling/running computations #' -#' Check first whether you can use [`epi_slide_mean`], [`epi_slide_sum`], or the -#' medium-generality [`epi_slide_opt`] instead, as they are faster and more -#' convenient to use. You typically only need to use `epi_slide()` if you have a -#' computation that depends on multiple columns simultaneously, outputs multiple -#' columns simultaneously, or produces non-numeric output. +#' Most rolling/running computations can be handled by [`epi_slide_mean`], +#' [`epi_slide_sum`], or the medium-generality [`epi_slide_opt`] functions +#' instead, which are much faster. You typically only need to consider +#' `epi_slide()` if you have a computation that depends on multiple columns +#' simultaneously, outputs multiple columns simultaneously, or produces +#' non-numeric output. For example, this computation depends on multiple +#' columns: #' #' ``` -#' # Create new column `cases_7dmed` that contains a 7-day trailing median of cases -#' epi_slide(edf, cases_7dmed = median(cases), .window_size = 7) +#' cases_deaths_subset %>% +#' epi_slide( +#' cfr_estimate_v0 = death_rate_7d_av[[22]]/case_rate_7d_av[[1]], +#' .window_size = 22 +#' ) %>% +#' print(n = 30) #' ``` #' -#' For two very common use cases, we provide optimized functions that are much -#' faster than `epi_slide`: `epi_slide_mean()` and `epi_slide_sum()`. We -#' recommend using these functions when possible. +#' (Here, the value 22 was selected using `epi_cor()` and averaging across +#' `geo_value`s. See +#' \href{https://www.medrxiv.org/content/10.1101/2024.12.27.24319518v1}{this +#' manuscript}{this manuscript} for some warnings & information using similar +#' types of CFR estimators.) #' #' See `vignette("epi_df")` for more examples. #' @@ -34,15 +42,19 @@ #' #' - Don't provide `.f`, and instead use use one or more #' [`dplyr::summarize`]-esque ["data-masking"][rlang::args_data_masking] -#' expressions in `...`, e.g., `cases_7dmed = median(cases)`. This is usually -#' the most convenient way to use `epi_slide`. See examples. +#' expressions in `...`, e.g., `cfr_estimate_v0 = +#' death_rate_7d_av[[22]]/case_rate_7d_av[[1]]`. This way is sometimes more +#' convenient, but also has the most computational overhead. #' -#' - Provide a formula in `.f`, e.g., `~ median(.x$cases)`. In this formula, -#' `.x` is an `epi_df` containing data for a single time window as described -#' above, taken from the original `.x` fed into `epi_slide()`. +#' - Provide a formula in `.f`, e.g., `~ +#' .x$death_rate_7d_av[[22]]/.x$case_rate_7d_av[[1]]`. In this formula, `.x` +#' is an `epi_df` containing data for a single time window as described above, +#' taken from the original `.x` fed into `epi_slide()`. #' -#' - Provide a function in `.f`. The function should be of the form `function(x, -#' g, t)` or `function(x, g, t, )`, where: +#' - Provide a function in `.f`, e.g., `function(x, g, t) +#' x$death_rate_7d_av[[22]]/x$case_rate_7d_av[[1]]`. The function should be of +#' the form `function(x, g, t)` or `function(x, g, t, )`, where: #' #' - `x` is a data frame with the same column names as the original object, #' minus any grouping variables, with only the windowed data for one @@ -60,7 +72,8 @@ #' The values of `g` and `t` are also available to data-masking expression and #' formula-based computations as `.group_key` and `.ref_time_value`, #' respectively. Formula computations also let you use `.y` or `.z`, -#' respectively. +#' respectively, as additional names for these same quantities (similar to +#' [`dplyr::group_modify`]). #' #' @param ... Additional arguments to pass to the function or formula specified #' via `.f`. Alternatively, if `.f` is missing, then the `...` is interpreted @@ -73,6 +86,26 @@ #' be given names that clash with the existing columns of `.x`. #' #' @details +#' +#' ## Motivation and lower-level alternatives +#' +#' `epi_slide()` is focused on preventing errors and providing a convenient +#' interface. If you need computational speed, many computations can be optimized +#' by one of the following: +#' +#' * Performing core sliding operations with `epi_slide_opt()` with +#' `frollapply`, and using potentially-grouped `mutate()`s to transform or +#' combine the results. +#' +#' * Grouping by `geo_value` and any `other_keys`; [`complete()`]ing with +#' `full_seq()` to fill in time gaps; `arrange()`ing by `time_value`s within +#' each group; using `mutate()` with vectorized operations and shift operators +#' like `dplyr::lead()` and `dplyr::lag()` to perform the core operations, +#' being careful to give the desired results for the least and most recent +#' `time_value`s (often `NA`s for the least recent); ungrouping; and +#' `filter()`ing back down to only rows that existed before the `complete()` +#' stage if necessary. +#' #' ## Advanced uses of `.f` via tidy evaluation #' #' If specifying `.f` via tidy evaluation, in addition to the standard [`.data`] @@ -96,34 +129,43 @@ #' @examples #' library(dplyr) #' -#' # Get the 7-day trailing standard deviation of cases and the 7-day trailing mean of cases -#' cases_deaths_subset %>% +#' # Generate some simple time-varying CFR estimates: +#' with_cfr_estimates <- cases_deaths_subset %>% #' epi_slide( -#' cases_7sd = sd(cases, na.rm = TRUE), -#' cases_7dav = mean(cases, na.rm = TRUE), -#' .window_size = 7 -#' ) %>% -#' select(geo_value, time_value, cases, cases_7sd, cases_7dav) -#' # Note that epi_slide_mean could be used to more quickly calculate cases_7dav. +#' cfr_estimate_v0 = death_rate_7d_av[[22]] / case_rate_7d_av[[1]], +#' .window_size = 22 +#' ) +#' with_cfr_estimates %>% +#' print(n = 30) +#' # (Here, the value 22 was selected using `epi_cor()` and averaging across +#' # `geo_value`s. See +#' # https://www.medrxiv.org/content/10.1101/2024.12.27.24319518v1 for some +#' # warnings & information using CFR estimators along these lines.) #' -#' # In addition to the [`dplyr::mutate`]-like syntax, you can feed in a function or -#' # formula in a way similar to [`dplyr::group_modify`]: -#' my_summarizer <- function(window_data) { -#' window_data %>% -#' summarize( -#' cases_7sd = sd(cases, na.rm = TRUE), -#' cases_7dav = mean(cases, na.rm = TRUE) -#' ) +#' # In addition to the [`dplyr::mutate`]-like syntax, you can feed in a +#' # function or formula in a way similar to [`dplyr::group_modify`]; these +#' # often run much more quickly: +#' my_computation <- function(window_data) { +#' tibble( +#' cfr_estimate_v0 = window_data$death_rate_7d_av[[nrow(window_data)]] / +#' window_data$case_rate_7d_av[[1]] +#' ) #' } -#' cases_deaths_subset %>% +#' with_cfr_estimates2 <- cases_deaths_subset %>% #' epi_slide( -#' ~ my_summarizer(.x), -#' .window_size = 7 -#' ) %>% -#' select(geo_value, time_value, cases, cases_7sd, cases_7dav) -#' -#' -#' +#' ~ my_computation(.x), +#' .window_size = 22 +#' ) +#' with_cfr_estimates3 <- cases_deaths_subset %>% +#' epi_slide( +#' function(window_data, g, t) { +#' tibble( +#' cfr_estimate_v0 = window_data$death_rate_7d_av[[nrow(window_data)]] / +#' window_data$case_rate_7d_av[[1]] +#' ) +#' }, +#' .window_size = 22 +#' ) #' #' #' #### Advanced: #### @@ -586,9 +628,9 @@ get_before_after_from_window <- function(window_size, align, time_type) { #' #' `epi_slide_opt` allows you to use any [data.table::froll] or #' [slider::summary-slide] function. If none of the specialized functions here -#' work, you can use `data.table::frollapply` with your own function. See -#' [`epi_slide`] if you need to work with multiple columns at once or output a -#' custom type. +#' work, you can use `data.table::frollapply` together with a non-rolling +#' function (e.g., `median`). See [`epi_slide`] if you need to work with +#' multiple columns at once or output a custom type. #' #' @template basic-slide-params #' @param .col_names <[`tidy-select`][dplyr_tidy_select]> An unquoted column diff --git a/man/epi_slide.Rd b/man/epi_slide.Rd index 83d1cc61..f8860555 100644 --- a/man/epi_slide.Rd +++ b/man/epi_slide.Rd @@ -36,12 +36,13 @@ You can specify the computation in one of the following ways: \itemize{ \item Don't provide \code{.f}, and instead use use one or more \code{\link[dplyr:summarise]{dplyr::summarize}}-esque \link[rlang:args_data_masking]{"data-masking"} -expressions in \code{...}, e.g., \code{cases_7dmed = median(cases)}. This is usually -the most convenient way to use \code{epi_slide}. See examples. -\item Provide a formula in \code{.f}, e.g., \code{~ median(.x$cases)}. In this formula, -\code{.x} is an \code{epi_df} containing data for a single time window as described -above, taken from the original \code{.x} fed into \code{epi_slide()}. -\item Provide a function in \code{.f}. The function should be of the form \verb{function(x, g, t)} or \verb{function(x, g, t, )}, where: +expressions in \code{...}, e.g., \code{cfr_estimate_v0 = death_rate_7d_av[[22]]/case_rate_7d_av[[1]]}. This way is sometimes more +convenient, but also has the most computational overhead. +\item Provide a formula in \code{.f}, e.g., \code{~ .x$death_rate_7d_av[[22]]/.x$case_rate_7d_av[[1]]}. In this formula, \code{.x} +is an \code{epi_df} containing data for a single time window as described above, +taken from the original \code{.x} fed into \code{epi_slide()}. +\item Provide a function in \code{.f}, e.g., \code{function(x, g, t) x$death_rate_7d_av[[22]]/x$case_rate_7d_av[[1]]}. The function should be of +the form \verb{function(x, g, t)} or \verb{function(x, g, t, )}, where: \itemize{ \item \code{x} is a data frame with the same column names as the original object, minus any grouping variables, with only the windowed data for one @@ -56,7 +57,8 @@ argument to \code{epi_slide()}. The values of \code{g} and \code{t} are also available to data-masking expression and formula-based computations as \code{.group_key} and \code{.ref_time_value}, respectively. Formula computations also let you use \code{.y} or \code{.z}, -respectively. +respectively, as additional names for these same quantities (similar to +\code{\link[dplyr:group_map]{dplyr::group_modify}}). }} \item{...}{Additional arguments to pass to the function or formula specified @@ -110,22 +112,50 @@ added. It will be ungrouped if \code{.x} was ungrouped, and have the same groups as \code{.x} if \code{.x} was grouped. } \description{ -Check first whether you can use \code{\link{epi_slide_mean}}, \code{\link{epi_slide_sum}}, or the -medium-generality \code{\link{epi_slide_opt}} instead, as they are faster and more -convenient to use. You typically only need to use \code{epi_slide()} if you have a -computation that depends on multiple columns simultaneously, outputs multiple -columns simultaneously, or produces non-numeric output. +Most rolling/running computations can be handled by \code{\link{epi_slide_mean}}, +\code{\link{epi_slide_sum}}, or the medium-generality \code{\link{epi_slide_opt}} functions +instead, which are much faster. You typically only need to consider +\code{epi_slide()} if you have a computation that depends on multiple columns +simultaneously, outputs multiple columns simultaneously, or produces +non-numeric output. For example, this computation depends on multiple +columns: } \details{ -\if{html}{\out{
}}\preformatted{# Create new column `cases_7dmed` that contains a 7-day trailing median of cases -epi_slide(edf, cases_7dmed = median(cases), .window_size = 7) +\if{html}{\out{
}}\preformatted{cases_deaths_subset \%>\% + epi_slide( + cfr_estimate_v0 = death_rate_7d_av[[22]]/case_rate_7d_av[[1]], + .window_size = 22 + ) \%>\% + print(n = 30) }\if{html}{\out{
}} -For two very common use cases, we provide optimized functions that are much -faster than \code{epi_slide}: \code{epi_slide_mean()} and \code{epi_slide_sum()}. We -recommend using these functions when possible. +(Here, the value 22 was selected using \code{epi_cor()} and averaging across +\code{geo_value}s. See +\href{https://www.medrxiv.org/content/10.1101/2024.12.27.24319518v1}{this +manuscript}{this manuscript} for some warnings & information using similar +types of CFR estimators.) See \code{vignette("epi_df")} for more examples. +\subsection{Motivation and lower-level alternatives}{ + +\code{epi_slide()} is focused on preventing errors and providing a convenient +interface. If you need computational speed, many computations can be optimized +by one of the following: +\itemize{ +\item Performing core sliding operations with \code{epi_slide_opt()} with +\code{frollapply}, and using potentially-grouped \code{mutate()}s to transform or +combine the results. +\item Grouping by \code{geo_value} and any \code{other_keys}; \code{\link[=complete]{complete()}}ing with +\code{full_seq()} to fill in time gaps; \code{arrange()}ing by \code{time_value}s within +each group; using \code{mutate()} with vectorized operations and shift operators +like \code{dplyr::lead()} and \code{dplyr::lag()} to perform the core operations, +being careful to give the desired results for the least and most recent +\code{time_value}s (often \code{NA}s for the least recent); ungrouping; and +\code{filter()}ing back down to only rows that existed before the \code{complete()} +stage if necessary. +} +} + \subsection{Advanced uses of \code{.f} via tidy evaluation}{ If specifying \code{.f} via tidy evaluation, in addition to the standard \code{\link{.data}} @@ -146,34 +176,43 @@ determined the time window for the current computation. \examples{ library(dplyr) -# Get the 7-day trailing standard deviation of cases and the 7-day trailing mean of cases -cases_deaths_subset \%>\% +# Generate some simple time-varying CFR estimates: +with_cfr_estimates <- cases_deaths_subset \%>\% epi_slide( - cases_7sd = sd(cases, na.rm = TRUE), - cases_7dav = mean(cases, na.rm = TRUE), - .window_size = 7 - ) \%>\% - select(geo_value, time_value, cases, cases_7sd, cases_7dav) -# Note that epi_slide_mean could be used to more quickly calculate cases_7dav. - -# In addition to the [`dplyr::mutate`]-like syntax, you can feed in a function or -# formula in a way similar to [`dplyr::group_modify`]: -my_summarizer <- function(window_data) { - window_data \%>\% - summarize( - cases_7sd = sd(cases, na.rm = TRUE), - cases_7dav = mean(cases, na.rm = TRUE) - ) + cfr_estimate_v0 = death_rate_7d_av[[22]] / case_rate_7d_av[[1]], + .window_size = 22 + ) +with_cfr_estimates \%>\% + print(n = 30) +# (Here, the value 22 was selected using `epi_cor()` and averaging across +# `geo_value`s. See +# https://www.medrxiv.org/content/10.1101/2024.12.27.24319518v1 for some +# warnings & information using CFR estimators along these lines.) + +# In addition to the [`dplyr::mutate`]-like syntax, you can feed in a +# function or formula in a way similar to [`dplyr::group_modify`]; these +# often run much more quickly: +my_computation <- function(window_data) { + tibble( + cfr_estimate_v0 = window_data$death_rate_7d_av[[nrow(window_data)]] / + window_data$case_rate_7d_av[[1]] + ) } -cases_deaths_subset \%>\% +with_cfr_estimates2 <- cases_deaths_subset \%>\% epi_slide( - ~ my_summarizer(.x), - .window_size = 7 - ) \%>\% - select(geo_value, time_value, cases, cases_7sd, cases_7dav) - - - + ~ my_computation(.x), + .window_size = 22 + ) +with_cfr_estimates3 <- cases_deaths_subset \%>\% + epi_slide( + function(window_data, g, t) { + tibble( + cfr_estimate_v0 = window_data$death_rate_7d_av[[nrow(window_data)]] / + window_data$case_rate_7d_av[[1]] + ) + }, + .window_size = 22 + ) #### Advanced: #### diff --git a/man/epi_slide_opt.Rd b/man/epi_slide_opt.Rd index c33a4208..687d2ac0 100644 --- a/man/epi_slide_opt.Rd +++ b/man/epi_slide_opt.Rd @@ -145,9 +145,9 @@ computations are over exactly \code{.window_size} values. \code{epi_slide_opt} allows you to use any \link[data.table:froll]{data.table::froll} or \link[slider:summary-slide]{slider::summary-slide} function. If none of the specialized functions here -work, you can use \code{data.table::frollapply} with your own function. See -\code{\link{epi_slide}} if you need to work with multiple columns at once or output a -custom type. +work, you can use \code{data.table::frollapply} together with a non-rolling +function (e.g., \code{median}). See \code{\link{epi_slide}} if you need to work with +multiple columns at once or output a custom type. \code{epi_slide_mean} is a wrapper around \code{epi_slide_opt} with \code{.f = data.table::frollmean}. From 8ac32e54f0e2c4c12af381c827863c49cb818033 Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Mon, 17 Mar 2025 13:00:31 -0700 Subject: [PATCH 11/16] Remove reference to removed "advanced" vignette --- R/methods-epi_archive.R | 3 +-- man/epix_slide.Rd | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/R/methods-epi_archive.R b/R/methods-epi_archive.R index 76a1ab17..e479ada4 100644 --- a/R/methods-epi_archive.R +++ b/R/methods-epi_archive.R @@ -737,8 +737,7 @@ epix_detailed_restricted_mutate <- function(.data, ...) { #' 4. There are no size stability checks or element/row recycling to maintain #' size stability in `epix_slide`, unlike in `epi_slide`. (`epix_slide` is #' roughly analogous to [`dplyr::group_modify`], while `epi_slide` is roughly -#' analogous to `dplyr::mutate` followed by `dplyr::arrange`) This is detailed -#' in the "advanced" vignette. +#' analogous to [`dplyr::mutate`].) #' 5. `.all_rows` is not supported in `epix_slide`; since the slide #' computations are allowed more flexibility in their outputs than in #' `epi_slide`, we can't guess a good representation for missing computations diff --git a/man/epix_slide.Rd b/man/epix_slide.Rd index 601d5811..7f109055 100644 --- a/man/epix_slide.Rd +++ b/man/epix_slide.Rd @@ -159,8 +159,7 @@ results as they are not supported by tibbles.) \item There are no size stability checks or element/row recycling to maintain size stability in \code{epix_slide}, unlike in \code{epi_slide}. (\code{epix_slide} is roughly analogous to \code{\link[dplyr:group_map]{dplyr::group_modify}}, while \code{epi_slide} is roughly -analogous to \code{dplyr::mutate} followed by \code{dplyr::arrange}) This is detailed -in the "advanced" vignette. +analogous to \code{\link[dplyr:mutate]{dplyr::mutate}}.) \item \code{.all_rows} is not supported in \code{epix_slide}; since the slide computations are allowed more flexibility in their outputs than in \code{epi_slide}, we can't guess a good representation for missing computations From 26cd412cec23b9e0eefaa01f15987bb41736fd75 Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Mon, 17 Mar 2025 15:15:44 -0700 Subject: [PATCH 12/16] docs(epix_slide): clarify .versions vs. epi_slide .ref_time_values defaults --- R/methods-epi_archive.R | 2 +- man/epix_slide.Rd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/R/methods-epi_archive.R b/R/methods-epi_archive.R index e479ada4..a33c91c9 100644 --- a/R/methods-epi_archive.R +++ b/R/methods-epi_archive.R @@ -744,7 +744,7 @@ epix_detailed_restricted_mutate <- function(.data, ...) { #' for excluded group-`.ref_time_value` pairs. #' 6. The `.versions` default for `epix_slide` is based on making an #' evenly-spaced sequence out of the `version`s in the `DT` plus the -#' `versions_end`, rather than the `time_value`s. +#' `versions_end`, rather than all unique `time_value`s. #' 7. `epix_slide()` computations can refer to the current element of #' `.versions` as either `.version` or `.ref_time_value`, while `epi_slide()` #' computations refer to the current element of `.ref_time_values` with diff --git a/man/epix_slide.Rd b/man/epix_slide.Rd index 7f109055..a72e2c68 100644 --- a/man/epix_slide.Rd +++ b/man/epix_slide.Rd @@ -166,7 +166,7 @@ computations are allowed more flexibility in their outputs than in for excluded group-\code{.ref_time_value} pairs. \item The \code{.versions} default for \code{epix_slide} is based on making an evenly-spaced sequence out of the \code{version}s in the \code{DT} plus the -\code{versions_end}, rather than the \code{time_value}s. +\code{versions_end}, rather than all unique \code{time_value}s. \item \code{epix_slide()} computations can refer to the current element of \code{.versions} as either \code{.version} or \code{.ref_time_value}, while \code{epi_slide()} computations refer to the current element of \code{.ref_time_values} with From 7a421f6bc1e83c9116e99a17aac11fa6979cfdac Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Mon, 17 Mar 2025 15:32:20 -0700 Subject: [PATCH 13/16] docs(validate_epi_archive): note omitted `*_type` checks --- R/archive.R | 3 +++ man/epi_archive.Rd | 3 +++ 2 files changed, 6 insertions(+) diff --git a/R/archive.R b/R/archive.R index 922f7783..f168e4dd 100644 --- a/R/archive.R +++ b/R/archive.R @@ -329,6 +329,9 @@ new_epi_archive <- function( #' Perform second (costly) round of validation that `x` is a proper `epi_archive` #' +#' Does not validate `geo_type` or `time_type` against `geo_value` and +#' `time_value` columns. These are assumed to have been set to compatibly. +#' #' @rdname epi_archive #' @export validate_epi_archive <- function(x) { diff --git a/man/epi_archive.Rd b/man/epi_archive.Rd index f98666e0..3918b4c1 100644 --- a/man/epi_archive.Rd +++ b/man/epi_archive.Rd @@ -109,6 +109,9 @@ only performs some fast, basic checks on the inputs. \code{validate_epi_archive} can perform more costly validation checks on its output. But most users should use \code{as_epi_archive}, which performs all necessary checks and has some additional features. + +Does not validate \code{geo_type} or \code{time_type} against \code{geo_value} and +\code{time_value} columns. These are assumed to have been set to compatibly. } \details{ An \code{epi_archive} contains a \code{data.table} object \code{DT} (from the From 0c9007e41530b4698a16493c0df043200e190af9 Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Mon, 17 Mar 2025 15:46:47 -0700 Subject: [PATCH 14/16] docs: remove duplicate word, mention `time_type` `epi_df` attr --- R/epi_df.R | 8 +++++--- R/slide.R | 2 +- man/epi_df.Rd | 8 +++++--- man/epi_slide.Rd | 2 +- 4 files changed, 12 insertions(+), 8 deletions(-) diff --git a/R/epi_df.R b/R/epi_df.R index 4955ab08..b9d999d9 100644 --- a/R/epi_df.R +++ b/R/epi_df.R @@ -6,7 +6,8 @@ #' which can be seen as measured variables at each key. In brief, an `epi_df` #' represents a snapshot of an epidemiological data set at a point in time. #' -#' @details An `epi_df` is a tibble with (at least) the following columns: +#' @details An `epi_df` is a kind of tibble with (at least) the following +#' columns: #' #' - `geo_value`: A character vector representing the geographical unit of #' observation. This could be a country code, a state name, a county code, @@ -14,10 +15,11 @@ #' - `time_value`: A date or integer vector representing the time of observation. #' #' Other columns can be considered as measured variables, which we also refer to -#' as signal variables. An `epi_df` object also has metadata with (at least) -#' the following fields: +#' as indicators or signals. An `epi_df` object also has metadata with (at +#' least) the following fields: #' #' * `geo_type`: the type for the geo values. +#' * `time_type`: the type for the time values. #' * `as_of`: the time value at which the given data were available. #' #' Most users should use `as_epi_df`. The input tibble `x` to the constructor diff --git a/R/slide.R b/R/slide.R index 863f6812..be112d07 100644 --- a/R/slide.R +++ b/R/slide.R @@ -40,7 +40,7 @@ #' #' You can specify the computation in one of the following ways: #' -#' - Don't provide `.f`, and instead use use one or more +#' - Don't provide `.f`, and instead use one or more #' [`dplyr::summarize`]-esque ["data-masking"][rlang::args_data_masking] #' expressions in `...`, e.g., `cfr_estimate_v0 = #' death_rate_7d_av[[22]]/case_rate_7d_av[[1]]`. This way is sometimes more diff --git a/man/epi_df.Rd b/man/epi_df.Rd index a6782718..0ec5830c 100644 --- a/man/epi_df.Rd +++ b/man/epi_df.Rd @@ -89,7 +89,8 @@ which can be seen as measured variables at each key. In brief, an \code{epi_df} represents a snapshot of an epidemiological data set at a point in time. } \details{ -An \code{epi_df} is a tibble with (at least) the following columns: +An \code{epi_df} is a kind of tibble with (at least) the following +columns: \itemize{ \item \code{geo_value}: A character vector representing the geographical unit of observation. This could be a country code, a state name, a county code, @@ -98,10 +99,11 @@ etc. } Other columns can be considered as measured variables, which we also refer to -as signal variables. An \code{epi_df} object also has metadata with (at least) -the following fields: +as indicators or signals. An \code{epi_df} object also has metadata with (at +least) the following fields: \itemize{ \item \code{geo_type}: the type for the geo values. +\item \code{time_type}: the type for the time values. \item \code{as_of}: the time value at which the given data were available. } diff --git a/man/epi_slide.Rd b/man/epi_slide.Rd index f8860555..8f86466a 100644 --- a/man/epi_slide.Rd +++ b/man/epi_slide.Rd @@ -34,7 +34,7 @@ data-frame-type column if you provide a name for such a column (e.g., via You can specify the computation in one of the following ways: \itemize{ -\item Don't provide \code{.f}, and instead use use one or more +\item Don't provide \code{.f}, and instead use one or more \code{\link[dplyr:summarise]{dplyr::summarize}}-esque \link[rlang:args_data_masking]{"data-masking"} expressions in \code{...}, e.g., \code{cfr_estimate_v0 = death_rate_7d_av[[22]]/case_rate_7d_av[[1]]}. This way is sometimes more convenient, but also has the most computational overhead. From 98fdca40daa4eb27580273f6f59cd30479d7a5d3 Mon Sep 17 00:00:00 2001 From: "Logan C. Brooks" Date: Mon, 17 Mar 2025 16:02:16 -0700 Subject: [PATCH 15/16] docs(NEWS.md): 0.12 NEWS entry + highlights for 0.11 --- NEWS.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/NEWS.md b/NEWS.md index 3ac814aa..1a122e0e 100644 --- a/NEWS.md +++ b/NEWS.md @@ -2,8 +2,22 @@ Pre-1.0.0 numbering scheme: 0.x will indicate releases, while 0.x.y will indicate PR's. +# epiprocess 0.12 + +## Improvements + +- Various documentation has been updated, simplified, and improved with better + examples. + # epiprocess 0.11 +## Highlights + +`{epiprocess}` should once again not require Rtools or a compiler to be able to +install! We've also updated some function interfaces to be more consistent +throughout the package & with tidyverse, and improved the generality of and +fixed bugs in various functions and documentation. + ## Breaking changes - `growth_rate()` argument order and names have changed. You will need to From 115251c5acc93dae14a2d9693ef641136bfa8f24 Mon Sep 17 00:00:00 2001 From: nmdefries <42820733+nmdefries@users.noreply.github.com> Date: Thu, 10 Apr 2025 14:30:26 -0400 Subject: [PATCH 16/16] wording --- R/methods-epi_archive.R | 4 ++-- R/slide.R | 10 +++++----- man/epi_slide.Rd | 4 ++-- man/epi_slide_opt.Rd | 6 +++--- man/epix_slide.Rd | 34 ++++++++++++++++++---------------- 5 files changed, 30 insertions(+), 28 deletions(-) diff --git a/R/methods-epi_archive.R b/R/methods-epi_archive.R index a33c91c9..42afbf46 100644 --- a/R/methods-epi_archive.R +++ b/R/methods-epi_archive.R @@ -640,7 +640,7 @@ epix_detailed_restricted_mutate <- function(.data, ...) { #' computation. The computation will be run on each requested group-version #' combination, with a time window filter applied if `.before` is supplied. #' -#' - If `.f` is a function must have the form `function(x, g, v)` or +#' If `.f` is a function must have the form `function(x, g, v)` or #' `function(x, g, v, )`, where #' #' - `x` is an `epi_df` with the same column names as the archive's `DT`, @@ -685,7 +685,7 @@ epix_detailed_restricted_mutate <- function(.data, ...) { #' to reporting latency. Unlike `epi_slide()`, `epix_slide()` won't fill in #' any missing `time_values` in this window. #' @param .versions Requested versions on which to run the computation. Each -#' requested `.version` also serves as the anchor point around which for which +#' requested `.version` also serves as the anchor point from which #' the `time_value` window specified by `.before` is drawn. If `.versions` is #' missing, it will be set to a regularly-spaced sequence of values set to #' cover the range of `version`s in the `DT` plus the `versions_end`; the diff --git a/R/slide.R b/R/slide.R index be112d07..c0032691 100644 --- a/R/slide.R +++ b/R/slide.R @@ -28,11 +28,11 @@ #' @template basic-slide-params #' @param .f,... The computation to slide. The input will be a time window of #' the data for a single subpopulation (i.e., a single `geo_value` and single -#' value for any [`other_keys`][as_epi_df] you set up for age groups, etc.). +#' value for any [`other_keys`][as_epi_df] you set up, such as age groups, race, etc.). #' The input will always have the same size, determined by `.window_size`, and #' will fill in any missing `time_values`, using `NA` values for missing #' measurements. The output should be a scalar value or a 1-row data frame; -#' these outputs will be collected and form a new column or columns in the +#' these outputs will be collected into a new column or columns in the #' `epi_slide()` result. Data frame outputs will be unpacked into multiple #' columns in the result by default, or [`tidyr::pack`]ed into a single #' data-frame-type column if you provide a name for such a column (e.g., via @@ -624,11 +624,11 @@ get_before_after_from_window <- function(window_size, align, time_type) { #' `.window_size = Inf`) on the requested columns. Explicit `NA` measurements #' are temporarily added to fill in any time gaps, and, for rolling #' computations, to pad the time series to ensure that the first & last -#' computations are over exactly `.window_size` values. +#' computations use exactly `.window_size` values. #' #' `epi_slide_opt` allows you to use any [data.table::froll] or -#' [slider::summary-slide] function. If none of the specialized functions here -#' work, you can use `data.table::frollapply` together with a non-rolling +#' [slider::summary-slide] function. If none of those specialized functions fit +#' your usecase, you can use `data.table::frollapply` together with a non-rolling #' function (e.g., `median`). See [`epi_slide`] if you need to work with #' multiple columns at once or output a custom type. #' diff --git a/man/epi_slide.Rd b/man/epi_slide.Rd index 8f86466a..734f66d2 100644 --- a/man/epi_slide.Rd +++ b/man/epi_slide.Rd @@ -22,11 +22,11 @@ and any columns in \code{other_keys}. If grouped, we make sure the grouping is b \item{.f, ...}{The computation to slide. The input will be a time window of the data for a single subpopulation (i.e., a single \code{geo_value} and single -value for any \code{\link[=as_epi_df]{other_keys}} you set up for age groups, etc.). +value for any \code{\link[=as_epi_df]{other_keys}} you set up, such as age groups, race, etc.). The input will always have the same size, determined by \code{.window_size}, and will fill in any missing \code{time_values}, using \code{NA} values for missing measurements. The output should be a scalar value or a 1-row data frame; -these outputs will be collected and form a new column or columns in the +these outputs will be collected into a new column or columns in the \code{epi_slide()} result. Data frame outputs will be unpacked into multiple columns in the result by default, or \code{\link[tidyr:pack]{tidyr::pack}}ed into a single data-frame-type column if you provide a name for such a column (e.g., via diff --git a/man/epi_slide_opt.Rd b/man/epi_slide_opt.Rd index 687d2ac0..9b39bdb5 100644 --- a/man/epi_slide_opt.Rd +++ b/man/epi_slide_opt.Rd @@ -141,11 +141,11 @@ computation, or alternatively, a running/cumulative computation (with \code{.window_size = Inf}) on the requested columns. Explicit \code{NA} measurements are temporarily added to fill in any time gaps, and, for rolling computations, to pad the time series to ensure that the first & last -computations are over exactly \code{.window_size} values. +computations use exactly \code{.window_size} values. \code{epi_slide_opt} allows you to use any \link[data.table:froll]{data.table::froll} or -\link[slider:summary-slide]{slider::summary-slide} function. If none of the specialized functions here -work, you can use \code{data.table::frollapply} together with a non-rolling +\link[slider:summary-slide]{slider::summary-slide} function. If none of those specialized functions fit +your usecase, you can use \code{data.table::frollapply} together with a non-rolling function (e.g., \code{median}). See \code{\link{epi_slide}} if you need to work with multiple columns at once or output a custom type. diff --git a/man/epix_slide.Rd b/man/epix_slide.Rd index a72e2c68..c0eb3f2b 100644 --- a/man/epix_slide.Rd +++ b/man/epix_slide.Rd @@ -43,22 +43,24 @@ all data in \code{x} will be treated as part of a single data group.} \item{.f}{Function, formula, or missing; together with \code{...} specifies the computation. The computation will be run on each requested group-version combination, with a time window filter applied if \code{.before} is supplied. -\itemize{ -\item If \code{.f} is a function must have the form \verb{function(x, g, v)} or + +If \code{.f} is a function must have the form \verb{function(x, g, v)} or \verb{function(x, g, v, )}, where -\itemize{ -\item \code{x} is an \code{epi_df} with the same column names as the archive's \code{DT}, -minus the \code{version} column. (Or, if \code{.all_versions = TRUE}, an -\code{epi_archive} with the requested partial version history.) -\item \code{g} is a one-row tibble containing the values of the grouping variables -for the associated group. -\item \code{v} (length-1) is the associated \code{version} (one of the requested -\code{.versions}) -\item \verb{} are optional; you can add such -arguments to your function and set them by passing them through the -\code{...} argument to \code{epix_slide()}. -} -} + +\if{html}{\out{
}}\preformatted{- `x` is an `epi_df` with the same column names as the archive's `DT`, + minus the `version` column. (Or, if `.all_versions = TRUE`, an + `epi_archive` with the requested partial version history.) + +- `g` is a one-row tibble containing the values of the grouping variables + for the associated group. + +- `v` (length-1) is the associated `version` (one of the requested + `.versions`) + +- `` are optional; you can add such + arguments to your function and set them by passing them through the + `...` argument to `epix_slide()`. +}\if{html}{\out{
}} If a formula, \code{.f} can operate directly on columns accessed via \code{.x$var} or \code{.$var}, as in \code{~ mean (.x$var)} to compute a mean of a column \code{var} for @@ -91,7 +93,7 @@ to reporting latency. Unlike \code{epi_slide()}, \code{epix_slide()} won't fill any missing \code{time_values} in this window.} \item{.versions}{Requested versions on which to run the computation. Each -requested \code{.version} also serves as the anchor point around which for which +requested \code{.version} also serves as the anchor point from which the \code{time_value} window specified by \code{.before} is drawn. If \code{.versions} is missing, it will be set to a regularly-spaced sequence of values set to cover the range of \code{version}s in the \code{DT} plus the \code{versions_end}; the