Skip to content

Commit 7d67021

Browse files
committed
Address #153, #124; make clobberable default quiet & msging clearer
For #153: * Make `epix_slide` use the latest version by making its default an evenly spaced sequence guessed from `DT$version`. (Using `guess_period` built on `gcd_num`, `gcd2num`.) * Make `epix_slide` generate error if any `ref_time_values` is outside acceptable range; don't silently filter. For #124: * Make `epi_slide` generate error if any `ref_time_values` is outside acceptable range; don't silently filter. For clobberable messaging: * Make default `clobberable_versions_start` be `NA`. * (Hopefully) improve messaging and docs. Update docs and NEWS.md. Fix invalidated tests.
1 parent d9b741c commit 7d67021

17 files changed

+307
-113
lines changed

NAMESPACE

+3
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,13 @@ S3method(group_by,epi_archive)
1414
S3method(group_by,epi_df)
1515
S3method(group_by,grouped_epi_archive)
1616
S3method(group_by_drop_default,grouped_epi_archive)
17+
S3method(groups,grouped_epi_archive)
1718
S3method(next_after,Date)
1819
S3method(next_after,integer)
1920
S3method(print,epi_df)
2021
S3method(summary,epi_df)
2122
S3method(ungroup,epi_df)
23+
S3method(ungroup,grouped_epi_archive)
2224
S3method(unnest,epi_df)
2325
export("%>%")
2426
export(archive_cases_dv_subset)
@@ -67,6 +69,7 @@ importFrom(dplyr,filter)
6769
importFrom(dplyr,group_by)
6870
importFrom(dplyr,group_by_drop_default)
6971
importFrom(dplyr,group_modify)
72+
importFrom(dplyr,groups)
7073
importFrom(dplyr,mutate)
7174
importFrom(dplyr,relocate)
7275
importFrom(dplyr,rename)

NEWS.md

+10
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,14 @@ development versions. A ".9999" suffix indicates a development version.
2323
`epi_slide`.
2424
* To obtain the old behavior, `dplyr::ungroup` the `epix_slide` result
2525
immediately.
26+
* `epix_slide` now guesses `ref_time_values` to be a regularly spaced sequence
27+
covering all the `DT$version` values and `version_end`, rather than the
28+
distinct `DT$time_value`s. To obtain the old behavior, pass in
29+
`ref_time_values = unique(<ungrouped archive>$DT$time_value)`.
30+
* `epi_archive`'s `clobberable_versions_start`'s default is now `NA`, so there
31+
will be no warnings by default about potential nonreproducibility. To obtain
32+
the old behavior, pass in `clobberable_versions_start =
33+
max_version_with_row_in(x)`.
2634

2735
## Potentially-breaking changes:
2836

@@ -34,6 +42,8 @@ development versions. A ".9999" suffix indicates a development version.
3442
* Changed `bind_rows` on grouped `epi_df`s to not drop the `epi_df` class. Like
3543
with ungrouped `epi_df`s, the metadata of the result is still simply taken
3644
from the first result, and may be inappropriate (#242).
45+
* `epi_slide` and `epix_slide` now raise an error rather than silently filtering
46+
out `ref_time_values` that don't meet their expectations.
3747

3848
## Improvements:
3949

R/archive.R

+26-31
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ validate_version_bound = function(version_bound, x, na_ok,
6161
}
6262
}
6363

64-
#' Default arg helper: `max(x$version)`, with error if `x` has 0 rows
64+
#' `max(x$version)`, with error if `x` has 0 rows
6565
#'
6666
#' Exported to make defaults more easily copyable.
6767
#'
@@ -233,15 +233,17 @@ epi_archive =
233233
#' carried forward (LOCF) to interpolate between the version data provided,
234234
#' rows that don't change these LOCF results can potentially be omitted to
235235
#' save space while maintaining the same behavior (with the help of the
236-
#' `clobberable_versions_start` and `versions_end` fields in some
237-
#' edge cases). `TRUE` will remove these rows, `FALSE` will not, and missing
238-
#' or `NULL` will remove these rows and issue a warning. Generally, this can
239-
#' be set to `TRUE`, but if you directly inspect or edit the fields of the
240-
#' `epi_archive` such as its `DT`, you will have to determine whether
241-
#' `compactify=TRUE` will produce the desired results. If compactification
242-
#' here is removing a large proportion of the rows, this may indicate a
243-
#' potential for space, time, or bandwidth savings upstream the data pipeline,
244-
#' e.g., when fetching, storing, or preparing the input data `x`
236+
#' `clobberable_versions_start` and `versions_end` fields in some edge cases).
237+
#' `TRUE` will remove these rows, `FALSE` will not, and missing or `NULL` will
238+
#' remove these rows and issue a warning. Generally, this can be set to
239+
#' `TRUE`, but if you directly inspect or edit the fields of the `epi_archive`
240+
#' such as its `DT`, or rely on redundant updates to achieve a certain
241+
#' behavior of the `ref_time_values` default in `epix_slide`, you will have to
242+
#' determine whether `compactify=TRUE` will produce the desired results. If
243+
#' compactification here is removing a large proportion of the rows, this may
244+
#' indicate a potential for space, time, or bandwidth savings upstream the
245+
#' data pipeline, e.g., by avoiding fetching, storing, or processing these
246+
#' rows of `x`.
245247
#' @param clobberable_versions_start Optional; as in [`as_epi_archive`]
246248
#' @param versions_end Optional; as in [`as_epi_archive`]
247249
#' @return An `epi_archive` object.
@@ -308,7 +310,7 @@ epi_archive =
308310
# Apply defaults and conduct checks and apply defaults for
309311
# `clobberable_versions_start`, `versions_end`:
310312
if (missing(clobberable_versions_start)) {
311-
clobberable_versions_start <- max_version_with_row_in(x)
313+
clobberable_versions_start <- NA
312314
}
313315
if (missing(versions_end)) {
314316
versions_end <- max_version_with_row_in(x)
@@ -465,7 +467,7 @@ epi_archive =
465467
Abort("`max_version` must be at most `self$versions_end`.")
466468
}
467469
if (!is.na(self$clobberable_versions_start) && max_version >= self$clobberable_versions_start) {
468-
Warn('Getting data as of some "clobberable" version that might be hotfixed, synced, or otherwise replaced later with different data using the same version tag. Thus, the snapshot that we produce here might not be reproducible later. See `?epi_archive` for more info and `?epix_as_of` on how to muffle.',
470+
Warn('Getting data as of some recent version which could still be overwritten (under routine circumstances) without assigning a new version number (a.k.a. "clobbered"). Thus, the snapshot that we produce here should not be expected to be reproducible later. See `?epi_archive` for more info and `?epix_as_of` on how to muffle.',
469471
class="epiprocess__snapshot_as_of_clobberable_version")
470472
}
471473

@@ -642,25 +644,18 @@ epi_archive =
642644
#' same `class` and `typeof` as `x$version`, or an `NA` of any `class` and
643645
#' `typeof`: specifically, either (a) the earliest version that could be
644646
#' subject to "clobbering" (being overwritten with different update data, but
645-
#' using the same version tag as the old update data), or (b) `NA`, to
647+
#' using the *same* version tag as the old update data), or (b) `NA`, to
646648
#' indicate that no versions are clobberable. There are a variety of reasons
647-
#' why versions could be clobberable, such as upstream hotfixes to the latest
648-
#' version, or delays in data synchronization that were mistaken for versions
649-
#' with no updates; potential causes vary between different data pipelines.
650-
#' The default value is `max_version_with_row_in(x)`; this default assumes
651-
#' that (i) if a row in `x` (even one that `compactify` would consider
652-
#' redundant) is present with version `ver`, then all previous versions must
653-
#' be finalized and non-clobberable, although `ver` (and onward) might still
654-
#' be modified, (ii) even if we have "observed" empty updates for some
655-
#' versions beyond `max(x$version)` (as indicated by `versions_end`;
656-
#' see below), we can't assume `max(x$version)` has been finalized, because we
657-
#' might see a nonfinalized version + empty subsequent versions due to
658-
#' upstream database replication delays in combination with the upstream
659-
#' replicas using last-version-carried-forward to extrapolate that there were
660-
#' no updates, (iii) "redundant" update rows that would be removed by
661-
#' `compactify` are not redundant, and actually come from an explicit version
662-
#' release that indicates that preceding versions are finalized. If `nrow(x)
663-
#' == 0`, then this argument is mandatory.
649+
#' why versions could be clobberable under routine circumstances, such as (a)
650+
#' today's version of one/all of the columns being published after initially
651+
#' being filled with `NA` or LOCF, (b) a buggy version of today's data being
652+
#' published but then fixed and republished later in the day, or (c) data
653+
#' pipeline delays (e.g., publisher uploading, periodic scraping, database
654+
#' syncing, periodic fetching, etc.) that make events (a) or (b) reflected
655+
#' later in the day (or even on a different day) than expected; potential
656+
#' causes vary between different data pipelines. The default value is `NA`,
657+
#' which doesn't consider any versions to be clobberable. Another setting that
658+
#' may be appropriate for some pipelines is `max_version_with_row_in(x)`.
664659
#' @param versions_end Optional; length-1, same `class` and `typeof` as
665660
#' `x$version`: what is the last version we have observed? The default is
666661
#' `max_version_with_row_in(x)`, but values greater than this could also be
@@ -717,7 +712,7 @@ epi_archive =
717712
as_epi_archive = function(x, geo_type, time_type, other_keys,
718713
additional_metadata = list(),
719714
compactify = NULL,
720-
clobberable_versions_start = max_version_with_row_in(x),
715+
clobberable_versions_start = NA,
721716
versions_end = max_version_with_row_in(x)) {
722717
epi_archive$new(x, geo_type, time_type, other_keys, additional_metadata,
723718
compactify, clobberable_versions_start, versions_end)

R/grouped_epi_archive.R

+14-7
Original file line numberDiff line numberDiff line change
@@ -189,14 +189,21 @@ grouped_epi_archive =
189189
")
190190
}
191191

192-
# If missing, then set ref time values to be everything; else make
193-
# sure we intersect with observed time values
194192
if (missing(ref_time_values)) {
195-
ref_time_values = unique(private$ungrouped$DT$time_value)
196-
}
197-
else {
198-
ref_time_values = ref_time_values[ref_time_values %in%
199-
unique(private$ungrouped$DT$time_value)]
193+
versions_with_updates = c(private$ungrouped$DT$version, private$ungrouped$versions_end)
194+
ref_time_values = tidyr::full_seq(versions_with_updates, guess_period(versions_with_updates))
195+
} else if (length(ref_time_values) == 0L) {
196+
Abort("`ref_time_values` must have at least one element.")
197+
} else if (any(is.na(ref_time_values))) {
198+
Abort("`ref_time_values` must not include `NA`.")
199+
} else if (anyDuplicated(ref_time_values) != 0L) {
200+
Abort("`ref_time_values` must not contain any duplicates; use `unique` if appropriate.")
201+
} else if (any(ref_time_values > private$ungrouped$versions_end)) {
202+
Abort("All `ref_time_values` must be `<=` the `versions_end`.")
203+
} else {
204+
# Sort, for consistency with `epi_slide`, although the current
205+
# implementation doesn't take advantage of it.
206+
ref_time_values = sort(ref_time_values)
200207
}
201208

202209
# Validate and pre-process `before`:

R/methods-epi_archive.R

+10-4
Original file line numberDiff line numberDiff line change
@@ -573,10 +573,13 @@ group_by.epi_archive = function(.data, ..., .add=FALSE, .drop=dplyr::group_by_dr
573573
#' were to hold forecasts, then we would expect data for `time_value`s after
574574
#' January 8, and the sliding window would extend as far after each
575575
#' `ref_time_value` as needed to include all such `time_value`s.)
576-
#' @param ref_time_values Time values for sliding computations, meaning, each
577-
#' element of this vector serves as the reference time point for one sliding
578-
#' window. If missing, then this will be set to all unique time values in the
579-
#' underlying data table, by default.
576+
#' @param ref_time_values Reference time values / versions for sliding
577+
#' computations; each element of this vector serves both as the anchor point
578+
#' for the `time_value` window for the computation and the `max_version`
579+
#' `as_of` which we fetch data in this window. If missing, then this will set
580+
#' to a regularly-spaced sequence of values set to cover the range of
581+
#' `version`s in the `DT` plus the `versions_end`; the spacing of values will
582+
#' be guessed (using the GCD of the skips between values).
580583
#' @param time_step Optional function used to define the meaning of one time
581584
#' step, which if specified, overrides the default choice based on the
582585
#' `time_value` column. This function must take a positive integer and return
@@ -633,6 +636,9 @@ group_by.epi_archive = function(.data, ..., .add=FALSE, .drop=dplyr::group_by_dr
633636
#' `time_value`, and all `other_keys` present in the version data with
634637
#' `time_value` matching one of the `ref_time_values`, this can have unexpected
635638
#' behaviors due reporting latency or reporting dropping in and out.
639+
#' 6. The `ref_time_values` default for `epix_slide` is based on making an
640+
#' evenly-spaced sequence out of the `version`s in the `DT` plus the
641+
#' `versions_end`, rather than the `time_value`s.
636642
#' Apart from this, the interfaces between `epix_slide()` and `epi_slide()` are
637643
#' the same.
638644
#'

R/slide.R

+14-7
Original file line numberDiff line numberDiff line change
@@ -124,16 +124,23 @@ epi_slide = function(x, f, ..., n, ref_time_values,
124124
# Arrange by increasing time_value
125125
x = arrange(x, time_value)
126126

127-
# If missing, then set ref time values to be everything; else make sure we
128-
# intersect with observed time values
129127
if (missing(ref_time_values)) {
130128
ref_time_values = unique(x$time_value)
131-
}
132-
else {
133-
ref_time_values = ref_time_values[ref_time_values %in%
134-
unique(x$time_value)]
135129
}
136-
130+
# Some of the checks below are possible to fail on the above default; just go
131+
# ahead and do the full validation & pre-processing even on the default:
132+
if (length(ref_time_values) == 0L) {
133+
Abort("`ref_time_values` must have at least one element.")
134+
} else if (any(is.na(ref_time_values))) {
135+
Abort("`ref_time_values` must not include `NA`.")
136+
} else if (anyDuplicated(ref_time_values) != 0L) {
137+
Abort("`ref_time_values` must not contain any duplicates; use `unique` if appropriate.")
138+
} else if (!all(ref_time_values %in% unique(x$time_value))) {
139+
Abort("All `ref_time_values` must appear in `x$time_value`.")
140+
} else {
141+
ref_time_values = sort(ref_time_values)
142+
}
143+
137144
# If before is missing, then use align to set up alignment
138145
if (missing(before)) {
139146
align = match.arg(align)

0 commit comments

Comments
 (0)