Skip to content

Allow or help handle epix_as_of last recorded version even when this version has no update #109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
brookslogan opened this issue Jun 18, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request P1 medium priority

Comments

@brookslogan
Copy link
Contributor

brookslogan commented Jun 18, 2022

When preparing prospective production forecasts, we may want to be able to mirror epix_slide or epix_as_of operations that we use to prepare pseudoprospective analyses. However, this may involve getting a snapshot epix_as_of a version with no update data (e.g., due to a nondaily update cadence, a holiday or some other occurrence causing the data source to be stale, or running the forecasts earlier than the data source has been updated and not being able to distinguish this from the data source responding that there has been no change). Currently, unless we have some "redundant" DT rows duplicating previous values with the latest no-change version, this is going to raise an error (max_version > self_max). (This also means that compactification (#101) could change the error behavior here.) We should consider one of the following:

  • take another arg to the archive constructor that acts as an override for self_max, and would allow this operation to go through with a warning rather than an error, or
  • provide some helper functions, parameters, and messages to deal with this situation: if we are dealing with a stale archive (max(DT$version) < forecast_ref_time), do we interpret this as if the data source reported no change (unless it looks really stale) and allow getting the snapshot, or do we interpret it as an issue fetching the data and, up to some point, allow ourselves to back up to epix_as_of the max DT version, providing this as a parameter to the slide function to adjust things? These are actually different options when you think about fetching backfill-aware training sets and forecast aheads relative to the snapshot version.
  • some sort of combination
@brookslogan
Copy link
Contributor Author

brookslogan commented Jun 18, 2022

To make this more concrete, using HHS COVID-19 hospitalization data; suppose every Monday we prepare a forecast of COVID-19 admissions based on recent COVID-19 admissions.




library(tidyverse)
library(data.table)
library(delphi.epidata)

covid.admissions.tbl = delphi.epidata::covidcast(
  data_source = "hhs",
  signals = "confirmed_admissions_covid_1d",
  time_type = "day",
  geo_type = "state",
  time_values = epirange(12340601, 20221201),
  geo_values = "ca,fl",
  issues = epirange(12340601, 20221201)
) %>% delphi.epidata::fetch_tbl()

some.dates.any.wday = seq(min(covid.admissions.tbl$issue), max(covid.admissions.tbl$issue), by="day")
some.mondays = some.dates.any.wday[as.POSIXlt(some.dates.any.wday)$wday==1L]

( # (parens for prettier data.table chain)
  data.table(issue=unique(covid.admissions.tbl$issue), key="issue")
  # duplicate the issue column so when we do a rolling join we get both the queried issue and the rolling-join-matched issue
  [, latest_available_nonempty_patch_issue := issue]
  [.(some.mondays), roll=TRUE]
  [, setnames(.SD, "issue", "forecast_date")]
  [, patch_delay_from_anywhere := as.integer(forecast_date - latest_available_nonempty_patch_issue)]
  # (analogue of dplyr::count:)  
  [, .N, by=patch_delay_from_anywhere]
)

yields [at time of running; should have set the end issue to something before today...]

   patch_delay_from_anywhere  N
1:                         0 71
2:                         1 11
3:                         3  1

So on 12 of these Mondays, even if we made forecasts at the very end of the day, we would not have a nonempty patch on that day. Earlier in the day, we might have more Mondays with this situation cropping up. We have to consider various possibilities

  • does HHS occasionally not release updates on Mondays? And if so, is this one of those Mondays? --- if it is one of those Mondays, then it makes sense to allow a snapshot as of the Monday, and grab backfill-aware covariates based on matching (training_issue - training_time_value) to (forecast_date - test_time_value) of covariates.
  • did our data pipeline break, and HHS released an update but it wasn't in the API, and we chose not to / weren't able to rewrite the history to mirror what HHS's version history looked like (rather than what the API's output history looked like)? --- if this is the case, we want to pretend like the forecast_date is earlier and our forecast aheads are longer.
  • is it too early in the day or is the data pipeline slow / being fixed, and there will be a nonempty patch ready later in the day? --- use same action as previous case
  • did a caching mechanism cause us to think that a nonempty patch was unavailable when it actually was? --- same action as previous case, but also, fix the caching issue
  • is something else going on? --- hopefully it will match one of the two approaches above, or some sort of intermediate approach where we pretend the forecast_date is somewhere between the date of the latest available nonempty patch and the actual forecast date

@brookslogan brookslogan added P1 medium priority enhancement New feature or request labels Jun 20, 2022
@brookslogan brookslogan self-assigned this Jun 24, 2022
@brookslogan
Copy link
Contributor Author

brookslogan commented Jun 27, 2022

Part 1, in progress:

  • add field along the lines of [updates_]finalized_through_version or {versions,updates}_finalized_through OR first_{unstable,tentative,beta,rewriteable,hotfixable,overwriteable,amendable,clobberable,preliminary}_version etc.
  • add field along the lines of [updates_]{unstable,tentative,rewriteable,beta,overwriteable,amendable,clobberable,preliminary}_through_version or seen_through_version or ...

(Should require max(DT$version), updates_first_hotfixable_version <= updates_hotfixable_through_version [maybe not quite --- there should be a way to mark all observed versions as finalized]. Set defaults consistent with current behavior; try to have defaults work when only one of these args is provided, thinking about actual hotfixes and DB replication delays when deciding appropriate behavior.)

Part 2: functions/parameters that work off of this information. These might belong here or might belong in epipredict.

@brookslogan
Copy link
Contributor Author

Part 1 is addressed in #101.

For Part 2: The epix_fill_through_version method might be helpful. However, the current epix_merge might complicate and cause issues. We need to check staleness on each data source before they are merged with any non-"stop" observed_versions_end_conflict resolution, because that may move forward the observed_versions_end and obscure potentially huge amounts of staleness of individual data sources / component archives. Putting a limit on the amount of observed_versions_end_conflict resolution in the merge doesn't seem right; if a is 4d staler than b and b is 6d stale, then a is 10d stale; we want to make decisions about the 10d. Maybe this conflict resolution needs to be removed from the merge and instead done in a vectorized epix_fill_through_version with some stopping capabilities based on individual staleness? Or maybe the merge needs an extra parameter or change of the observed_versions_end_conflict possible values to check each input for its overall staleness before doing filling to a common observed_versions_end (which would need to accommodate extending the observed_versions_end to make the merged archive look fresher than either of its two input archives, in the case that both have some staleness).

@brookslogan
Copy link
Contributor Author

The naniar package may let us attach missingness reasons, and make it so we don't necessarily lose information / warnings we'd want to give by doing things in multiple steps.

@dshemetov
Copy link
Contributor

dshemetov commented Apr 23, 2025

Is there still some work to be done here? I don't quite follow the discussion, but wonder if step_adjust_latency has mitigated some of the core concerns. If there's more to be done, then we probably need to clarify that and then icebox it. In any case, I'm going to unassign myself from it.

@dshemetov dshemetov removed their assignment Apr 23, 2025
@brookslogan
Copy link
Contributor Author

brookslogan commented Apr 23, 2025

step_adjust_latency's check_latency feature is along the same lines, but we may want something similar in epiprocess and not necessarily tied to adjusting latency in a forecast, and more configurable. E.g.,

  • Abort if signal A is more than B days latent or C is more than D days latent... for any epikey?
    • Potentially for various definitions of latent:
      • {now} minus versions_end for a particular signal (currently impossible to do signal-by-signal post-epix_merge)
      • {now} minus max(version) with a "real" diff for any epikey (currently impossible to do signal-by-signal post-epix_merge precisely: can't distinguish an initial report with an explicit "measured" NA vs. an implicit NA made explicit by an epix_merge).
      • {now} minus max(version) with a diff to a non-NA value.
      • {now} minus max(time_value) with one of the above
      • relevant_version - max_relevant_time_value for one of the above
  • Warn instead of abort if within some limit

We might decide to produce forecasts ahead of the max time value rather than horizon ahead of the version, or just not forecast at all.

So there are maybe three potential remaining elements:

  • Making this check functionality available in epiprocess
  • Making check functionality more flexible
  • Making epix_merge format not lose some information about explicit NAs introduced by merge vs. other explicit NAs vs. implicit missingness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1 medium priority
Projects
None yet
Development

No branches or pull requests

2 participants