-
Notifications
You must be signed in to change notification settings - Fork 68
[Request] Reconsider storing versions in YYYYWW format for NHSN data #1628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Our NHSN pipeline pulls from two different datasets at data.cdc.gov. Both have the same schema/structure, but one is the "preliminary" dataset that is collected and reported before the week is complete. Our pipeline emits essentially the same signals/indicators from each, but those that come from the "preliminary" dataset have " There is some timing stuff to be fully worked out -- there are no longer any Thursday runs in the job schedule, and the 12:30pm scheduled start (chosen to make sure the data would be available in our API as early as possible on submission days) may be too early to be practical -- but data imported on a Wednesday run should not overwrite anything but the data from a previous Wednesday run. [[ Click for demonstration of issue dates per signal ]]Note this is looking at the regular and prelim versions of a single signal/indicator, and only for the nation-level geo. The results are sorted by time value, then issue, then signal. import requests
datas = requests.get(
"http://api.delphi.cmu.edu/epidata/covidcast/?data_source=nhsn"
"&signals=confirmed_admissions_covid_ew,confirmed_admissions_covid_ew_prelim"
"&time_type=week&time_values=*"
"&geo_type=nation&geo_value=us"
).json()["epidata"]
keylist = "time_value issue signal".split()
print(
"\n".join(sorted(
[", ".join([f"{k}: {d[k]}" for k in keylist]) for d in datas]
))
) Output from the above code, run today (a Thursday), shows the prelim signal has gotten a new issue, but not the other:
Running the same code tomorrow afternoon/evening will show them all with issue 202513. There are a number of things in the code that assume that the If the current state of things is not suitable, it seems prudent to change all of the NHSN signals to |
Ah right, having the prelim signal separated definitely helps avoid overwrites here, so thanks for the correction @melange396. After speaking with @biganemone, let me try to separate out a couple motivations mixed in this issue:
For the first, Amaris will address it with a patch, and I sent her the raw data already.
(I'm leaning towards the first, so we can focus on long-term pipeline rewrites.) For the second, given your points about preliminary data and non-prelim data being updated on different days, I think our current pipeline and update schedule is fine for this season. For the third, we should discuss the implications of the problems here for future pipeline design. There might be a way to avoid these issues in the future. I'll plan a design meeting, TBD. |
Problem
Our covidcast weekly signals pass through this acquisition code, which detects an epiweek time format YYYYWW in the receiving file name and assigns it as the issue value. As far as I understand, if the source provides two or more issues in the same epiweek, we keep only the latest one. This is a problem for the forecasting team as it makes accurate backtesting impossible: if a source (like NHSN) updates later in the same week than the forecast date, our database will show data that wasn't available at forecast time.
Here is a plot showing NHSN update times on the x-axis and the epiweek that time would be assigned to on the y-axis. The red-dashed line is the forecast date. Points to the right of the forecast date but in the same epiweek will be in our db's historical record for that week, but weren't available at forecast time.
cc @dsweber2 @brookslogan @aysim319 @melange396
Plot generated with:
Possible Solutions
issue=(date.today(), epi.Week.fromdate(date.today()))
in the code, just use date.today() in this caseThe text was updated successfully, but these errors were encountered: