Allow or help handle `epix_as_of` last recorded version even when this version has no update #109

brookslogan · 2022-06-18T01:57:17Z

When preparing prospective production forecasts, we may want to be able to mirror epix_slide or epix_as_of operations that we use to prepare pseudoprospective analyses. However, this may involve getting a snapshot epix_as_of a version with no update data (e.g., due to a nondaily update cadence, a holiday or some other occurrence causing the data source to be stale, or running the forecasts earlier than the data source has been updated and not being able to distinguish this from the data source responding that there has been no change). Currently, unless we have some "redundant" DT rows duplicating previous values with the latest no-change version, this is going to raise an error (max_version > self_max). (This also means that compactification (#101) could change the error behavior here.) We should consider one of the following:

take another arg to the archive constructor that acts as an override for self_max, and would allow this operation to go through with a warning rather than an error, or
provide some helper functions, parameters, and messages to deal with this situation: if we are dealing with a stale archive (max(DT$version) < forecast_ref_time), do we interpret this as if the data source reported no change (unless it looks really stale) and allow getting the snapshot, or do we interpret it as an issue fetching the data and, up to some point, allow ourselves to back up to epix_as_of the max DT version, providing this as a parameter to the slide function to adjust things? These are actually different options when you think about fetching backfill-aware training sets and forecast aheads relative to the snapshot version.
some sort of combination

The text was updated successfully, but these errors were encountered:

brookslogan · 2022-06-18T02:50:05Z

To make this more concrete, using HHS COVID-19 hospitalization data; suppose every Monday we prepare a forecast of COVID-19 admissions based on recent COVID-19 admissions.




library(tidyverse)
library(data.table)
library(delphi.epidata)

covid.admissions.tbl = delphi.epidata::covidcast(
  data_source = "hhs",
  signals = "confirmed_admissions_covid_1d",
  time_type = "day",
  geo_type = "state",
  time_values = epirange(12340601, 20221201),
  geo_values = "ca,fl",
  issues = epirange(12340601, 20221201)
) %>% delphi.epidata::fetch_tbl()

some.dates.any.wday = seq(min(covid.admissions.tbl$issue), max(covid.admissions.tbl$issue), by="day")
some.mondays = some.dates.any.wday[as.POSIXlt(some.dates.any.wday)$wday==1L]

( # (parens for prettier data.table chain)
  data.table(issue=unique(covid.admissions.tbl$issue), key="issue")
  # duplicate the issue column so when we do a rolling join we get both the queried issue and the rolling-join-matched issue
  [, latest_available_nonempty_patch_issue := issue]
  [.(some.mondays), roll=TRUE]
  [, setnames(.SD, "issue", "forecast_date")]
  [, patch_delay_from_anywhere := as.integer(forecast_date - latest_available_nonempty_patch_issue)]
  # (analogue of dplyr::count:)  
  [, .N, by=patch_delay_from_anywhere]
)

yields [at time of running; should have set the end issue to something before today...]

   patch_delay_from_anywhere  N
1:                         0 71
2:                         1 11
3:                         3  1

So on 12 of these Mondays, even if we made forecasts at the very end of the day, we would not have a nonempty patch on that day. Earlier in the day, we might have more Mondays with this situation cropping up. We have to consider various possibilities

does HHS occasionally not release updates on Mondays? And if so, is this one of those Mondays? --- if it is one of those Mondays, then it makes sense to allow a snapshot as of the Monday, and grab backfill-aware covariates based on matching (training_issue - training_time_value) to (forecast_date - test_time_value) of covariates.
did our data pipeline break, and HHS released an update but it wasn't in the API, and we chose not to / weren't able to rewrite the history to mirror what HHS's version history looked like (rather than what the API's output history looked like)? --- if this is the case, we want to pretend like the forecast_date is earlier and our forecast aheads are longer.
is it too early in the day or is the data pipeline slow / being fixed, and there will be a nonempty patch ready later in the day? --- use same action as previous case
did a caching mechanism cause us to think that a nonempty patch was unavailable when it actually was? --- same action as previous case, but also, fix the caching issue
is something else going on? --- hopefully it will match one of the two approaches above, or some sort of intermediate approach where we pretend the forecast_date is somewhere between the date of the latest available nonempty patch and the actual forecast date

brookslogan · 2022-06-27T15:49:27Z

Part 1, in progress:

add field along the lines of [updates_]finalized_through_version or {versions,updates}_finalized_through OR first_{unstable,tentative,beta,rewriteable,hotfixable,overwriteable,amendable,clobberable,preliminary}_version etc.
add field along the lines of [updates_]{unstable,tentative,rewriteable,beta,overwriteable,amendable,clobberable,preliminary}_through_version or seen_through_version or ...

(Should require max(DT$version), updates_first_hotfixable_version <= updates_hotfixable_through_version [maybe not quite --- there should be a way to mark all observed versions as finalized]. Set defaults consistent with current behavior; try to have defaults work when only one of these args is provided, thinking about actual hotfixes and DB replication delays when deciding appropriate behavior.)

Part 2: functions/parameters that work off of this information. These might belong here or might belong in epipredict.

brookslogan · 2022-07-25T09:31:53Z

Part 1 is addressed in #101.

For Part 2: The epix_fill_through_version method might be helpful. However, the current epix_merge might complicate and cause issues. We need to check staleness on each data source before they are merged with any non-"stop" observed_versions_end_conflict resolution, because that may move forward the observed_versions_end and obscure potentially huge amounts of staleness of individual data sources / component archives. Putting a limit on the amount of observed_versions_end_conflict resolution in the merge doesn't seem right; if a is 4d staler than b and b is 6d stale, then a is 10d stale; we want to make decisions about the 10d. Maybe this conflict resolution needs to be removed from the merge and instead done in a vectorized epix_fill_through_version with some stopping capabilities based on individual staleness? Or maybe the merge needs an extra parameter or change of the observed_versions_end_conflict possible values to check each input for its overall staleness before doing filling to a common observed_versions_end (which would need to accommodate extending the observed_versions_end to make the merged archive look fresher than either of its two input archives, in the case that both have some staleness).

brookslogan · 2024-11-07T19:58:01Z

The naniar package may let us attach missingness reasons, and make it so we don't necessarily lose information / warnings we'd want to give by doing things in multiple steps.

dshemetov · 2025-04-23T01:24:34Z

Is there still some work to be done here? I don't quite follow the discussion, but wonder if step_adjust_latency has mitigated some of the core concerns. If there's more to be done, then we probably need to clarify that and then icebox it. In any case, I'm going to unassign myself from it.

brookslogan · 2025-04-23T22:20:09Z

step_adjust_latency's check_latency feature is along the same lines, but we may want something similar in epiprocess and not necessarily tied to adjusting latency in a forecast, and more configurable. E.g.,

Abort if signal A is more than B days latent or C is more than D days latent... for any epikey?
- Potentially for various definitions of latent:
  - {now} minus versions_end for a particular signal (currently impossible to do signal-by-signal post-epix_merge)
  - {now} minus max(version) with a "real" diff for any epikey (currently impossible to do signal-by-signal post-epix_merge precisely: can't distinguish an initial report with an explicit "measured" NA vs. an implicit NA made explicit by an epix_merge).
  - {now} minus max(version) with a diff to a non-NA value.
  - {now} minus max(time_value) with one of the above
  - relevant_version - max_relevant_time_value for one of the above
Warn instead of abort if within some limit

We might decide to produce forecasts ahead of the max time value rather than horizon ahead of the version, or just not forecast at all.

So there are maybe three potential remaining elements:

Making this check functionality available in epiprocess
Making check functionality more flexible
Making epix_merge format not lose some information about explicit NAs introduced by merge vs. other explicit NAs vs. implicit missingness.

brookslogan added P1 medium priority enhancement New feature or request labels Jun 20, 2022

This was referenced Jun 21, 2022

Port Jingjing's backcasting preprocessing utils #114

Closed

Km compactify rectify #101

Merged

brookslogan self-assigned this Jun 24, 2022

rachlobay mentioned this issue Aug 16, 2022

mapply() error when trying to use percent_cli as a covariate in arx_epi_forecaster() #209

Closed

brookslogan mentioned this issue Jul 26, 2023

(How) should we explicitly support partial version histories? #352

Open

brookslogan mentioned this issue Apr 1, 2025

feat: introduce epix_as_of_current() for convenience #645

Merged

4 tasks

dshemetov self-assigned this Apr 1, 2025

dshemetov removed their assignment Apr 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow or help handle `epix_as_of` last recorded version even when this version has no update #109

Allow or help handle `epix_as_of` last recorded version even when this version has no update #109

brookslogan commented Jun 18, 2022 •

edited

Loading

brookslogan commented Jun 18, 2022 •

edited

Loading

brookslogan commented Jun 27, 2022 •

edited

Loading

brookslogan commented Jul 25, 2022

brookslogan commented Nov 7, 2024

dshemetov commented Apr 23, 2025 •

edited

Loading

brookslogan commented Apr 23, 2025 •

edited

Loading

Allow or help handle epix_as_of last recorded version even when this version has no update #109

Allow or help handle epix_as_of last recorded version even when this version has no update #109

Comments

brookslogan commented Jun 18, 2022 • edited Loading

brookslogan commented Jun 18, 2022 • edited Loading

brookslogan commented Jun 27, 2022 • edited Loading

brookslogan commented Jul 25, 2022

brookslogan commented Nov 7, 2024

dshemetov commented Apr 23, 2025 • edited Loading

brookslogan commented Apr 23, 2025 • edited Loading

Allow or help handle `epix_as_of` last recorded version even when this version has no update #109

Allow or help handle `epix_as_of` last recorded version even when this version has no update #109

brookslogan commented Jun 18, 2022 •

edited

Loading

brookslogan commented Jun 18, 2022 •

edited

Loading

brookslogan commented Jun 27, 2022 •

edited

Loading

dshemetov commented Apr 23, 2025 •

edited

Loading

brookslogan commented Apr 23, 2025 •

edited

Loading