[Request] Reconsider storing versions in YYYYWW format for NHSN data #1628

dshemetov · 2025-03-27T18:50:45Z

Problem

Our covidcast weekly signals pass through this acquisition code, which detects an epiweek time format YYYYWW in the receiving file name and assigns it as the issue value. As far as I understand, if the source provides two or more issues in the same epiweek, we keep only the latest one. This is a problem for the forecasting team as it makes accurate backtesting impossible: if a source (like NHSN) updates later in the same week than the forecast date, our database will show data that wasn't available at forecast time.

Here is a plot showing NHSN update times on the x-axis and the epiweek that time would be assigned to on the y-axis. The red-dashed line is the forecast date. Points to the right of the forecast date but in the same epiweek will be in our db's historical record for that week, but weren't available at forecast time.

cc @dsweber2 @brookslogan @aysim319 @melange396

Plot generated with:

library(aws.s3)
library(lubridate)
library(MMWRweek)
library(tidyverse)

# Bucket file format is like 
# nhsn_data_raw_2024-12-18_11-01-08.124565_prelim.parquet
# where the time is in UTC, ymd_hms assumes UTC by default, we use with_tz 
# below to translate to PST time (which is more correct for determining day 
# boundaries)
get_version_timestamp <- function(filename) ymd_hms(str_match(filename, "[0-9]{4}-..-.._..-..-..\\.[^.^_]*"))
get_epiweek_from_timestamp <- function(timestamp) {
  paste0(MMWRweek::MMWRweek(timestamp)$MMWRyear, "-", str_pad(MMWRweek::MMWRweek(timestamp)$MMWRweek, 2, "left", "0"))
}
# Requires credentials to the forecasting-team-data bucket
update_times <- aws.s3::get_bucket_df(prefix = "nhsn_data_raw", bucket = "forecasting-team-data") %>%
  pull(Key) %>%
  get_version_timestamp() %>%
  with_tz(tzone = "America/Los_Angeles")
epiweeks <- update_times %>% get_epiweek_from_timestamp()
# These were the actual forecast dates this season (accounting for holidays and
# other delays)
forecast_dates <- c(
  as.Date(c("2024-11-22", "2024-11-27", "2024-12-04", "2024-12-11", "2024-12-18", "2024-12-26", "2025-01-02")),
  seq.Date(as.Date("2025-01-08"), Sys.Date(), by = 7L)
)
ggplot(data.frame(update_times, epiweeks), aes(x = as.Date(update_times), y = epiweeks)) +
  geom_point() +
  geom_vline(xintercept = forecast_dates, color = "red", linetype = "dashed") +
  theme_minimal() +
  labs(x = "Update at", y = "Epiweek")

Possible Solutions

Change our NHSN Cronicle schedule to not run after Wednesday. Last I heard, our Cronicle update schedule is "wednesday/friday @ 12:30pm [est]", so currently Thursday and Friday updates are overwriting the Wednesday updates. This is a simple solution and has the advantage of no code updates, but it's fragile/manual, since forecast dates tend to get delayed by holidays and data outages, which would require us to be on call and modify the Cronicle schedule as needed. Possibly the correct thing for this season, which is winding down, but a burden long-term.
Since issue defaults to issue=(date.today(), epi.Week.fromdate(date.today())) in the code, just use date.today() in this case
- Sounds simple, but other parts of the code might unexpectedly depend on time_value and issue being the same format
- Also, unclear how we could carve out an acquisition logic exception for NHSN
Store both time_value and issue (for NHSN) as date and use documentation to clarify that each value represents a weekly sum (the way we already do with 7dav signals)
- Avoids the possible consistency issues from first approach
- We have the raw data files for the whole season stored in an S3 bucket, so playing forward through an updated acquisition pipeline is possible
- Requires NHSN indicator code changes
Legacy weekly signals have the same problem and it's out of scope to fix those

The text was updated successfully, but these errors were encountered:

melange396 · 2025-03-28T00:59:34Z

Our NHSN pipeline pulls from two different datasets at data.cdc.gov. Both have the same schema/structure, but one is the "preliminary" dataset that is collected and reported before the week is complete. Our pipeline emits essentially the same signals/indicators from each, but those that come from the "preliminary" dataset have "_prelim" appended to their names. The preliminary dataset is updated on the CDC site on Wednesdays and the other on Fridays, both at around noonish. We currently run our pipeline code on those two days (and only those two) at 12:30pm, very shortly after the updates are expected to have been done by CDC. Importantly, the pipeline code only processes a dataset if it has been updated on the CDC site within the past 24h (they provide a "Last Updated" timestamp) -- this means, barring problems or unexpected situations, the Wednesday run updates only the "*_prelim" signals, and the Friday run updates only the non-"*_prelim" signals.

There is some timing stuff to be fully worked out -- there are no longer any Thursday runs in the job schedule, and the 12:30pm scheduled start (chosen to make sure the data would be available in our API as early as possible on submission days) may be too early to be practical -- but data imported on a Wednesday run should not overwrite anything but the data from a previous Wednesday run.

[[ Click for demonstration of issue dates per signal ]]

Note this is looking at the regular and prelim versions of a single signal/indicator, and only for the nation-level geo. The results are sorted by time value, then issue, then signal.

import requests

datas = requests.get(
  "http://api.delphi.cmu.edu/epidata/covidcast/?data_source=nhsn"
  "&signals=confirmed_admissions_covid_ew,confirmed_admissions_covid_ew_prelim"
  "&time_type=week&time_values=*"
  "&geo_type=nation&geo_value=us"
).json()["epidata"]

keylist = "time_value issue signal".split()

print(
  "\n".join(sorted(
    [", ".join([f"{k}: {d[k]}" for k in keylist]) for d in datas]
  ))
)

Output from the above code, run today (a Thursday), shows the prelim signal has gotten a new issue, but not the other:

...
time_value: 202507, issue: 202512, signal: confirmed_admissions_covid_ew
time_value: 202507, issue: 202513, signal: confirmed_admissions_covid_ew_prelim
time_value: 202508, issue: 202512, signal: confirmed_admissions_covid_ew
time_value: 202508, issue: 202513, signal: confirmed_admissions_covid_ew_prelim
time_value: 202509, issue: 202512, signal: confirmed_admissions_covid_ew
time_value: 202509, issue: 202513, signal: confirmed_admissions_covid_ew_prelim
time_value: 202510, issue: 202512, signal: confirmed_admissions_covid_ew
time_value: 202510, issue: 202513, signal: confirmed_admissions_covid_ew_prelim
time_value: 202511, issue: 202512, signal: confirmed_admissions_covid_ew
time_value: 202511, issue: 202513, signal: confirmed_admissions_covid_ew_prelim
time_value: 202512, issue: 202513, signal: confirmed_admissions_covid_ew_prelim

Running the same code tomorrow afternoon/evening will show them all with issue 202513.

There are a number of things in the code that assume that the time_value and issue of a single record both conform to the same time_type. It may be possible to change that, but i dont know if its worth it, plus it could break some users' existing workflows.

If the current state of things is not suitable, it seems prudent to change all of the NHSN signals to time_type=day instead of week, but the pipeline code will need additional (potentially tricky) logic to get rid of the separate "*_prelim" signals/indicators and roll them into intermediate issues of the base signals/indicators.

dshemetov · 2025-03-31T20:39:08Z

Ah right, having the prelim signal separated definitely helps avoid overwrites here, so thanks for the correction @melange396.

After speaking with @biganemone, let me try to separate out a couple motivations mixed in this issue:

addressing current NHSN data errors
making sure current NHSN pipeline works correctly
envisioning how future pipelines should work

For the first, Amaris will address it with a patch, and I sent her the raw data already.

if we choose to stay with time_type=week, then the patch will need to be careful about which data is included in any given week, since for holiday / other delay reasons the data had a lot of variance in its upload times
if we choose to switch to time_type=day, then the indicators code will need some tweaks, but the patch data inclusion code can be simpler

(I'm leaning towards the first, so we can focus on long-term pipeline rewrites.)

For the second, given your points about preliminary data and non-prelim data being updated on different days, I think our current pipeline and update schedule is fine for this season.

For the third, we should discuss the implications of the problems here for future pipeline design. There might be a way to avoid these issues in the future. I'll plan a design meeting, TBD.

dshemetov changed the title ~~[Request] Store issue in YYYYMMDD format for NHSN to increase version resolution~~ [Request] Store issue in YYYYMMDD format for NHSN to improve backtesting correctness Mar 27, 2025

dshemetov changed the title ~~[Request] Store issue in YYYYMMDD format for NHSN to improve backtesting correctness~~ [Request] Reconsider storing versions in YYYYWW format for NHSN data Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] Reconsider storing versions in YYYYWW format for NHSN data #1628

[Request] Reconsider storing versions in YYYYWW format for NHSN data #1628

dshemetov commented Mar 27, 2025 •

edited

Loading

melange396 commented Mar 28, 2025

dshemetov commented Mar 31, 2025 •

edited

Loading

[Request] Reconsider storing versions in YYYYWW format for NHSN data #1628

[Request] Reconsider storing versions in YYYYWW format for NHSN data #1628

Comments

dshemetov commented Mar 27, 2025 • edited Loading

Problem

Possible Solutions

melange396 commented Mar 28, 2025

dshemetov commented Mar 31, 2025 • edited Loading

dshemetov commented Mar 27, 2025 •

edited

Loading

dshemetov commented Mar 31, 2025 •

edited

Loading