-
Notifications
You must be signed in to change notification settings - Fork 68
covid_hosp improvements to address and investigate long update running times #1083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* includes deployment
* also adds some previously-absent logging
* Switch to executemany * Add new limited_geocode datatype * Fix test prep missing from #1030
…ility-running-time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might we want to consider making all the logger
arguments non-optional, which will save us from having to check if logger
every time its used?
sorry that took as long as it did. i pulled this branch to try some things semi-related to this and semi-related to the auto column work/exploration, then ran into problems with my build environment that took a while to track down (which ended up being massive file artifacts in my repo tree that docker copied around all over the place).
@@ -97,7 +97,7 @@ def test_run_skip_old_dataset(self): | |||
mock_network = MagicMock() | |||
mock_network.fetch_metadata.return_value = \ | |||
self.test_utils.load_sample_metadata() | |||
mock_database = MagicMock() | |||
mock_database = MagicMock(**{"__module__":"test_module", "__name__":"test_name"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this not work?
mock_database = MagicMock(**{"__module__":"test_module", "__name__":"test_name"}) | |
mock_database = MagicMock(__module__="test_module", __name__="test_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh! it does! i must've assumed __name__
was reserved
@@ -242,4 +268,6 @@ def get_max_issue(self): | |||
for (result,) in cursor: | |||
if result is not None: | |||
return pd.Timestamp(str(result)) | |||
if logger: | |||
logger.info("get_max_issue", msg="no matching results in meta table; returning 1900/1/1 epoch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a warning
-level call, since this should only happen ~once in the lifetime of a dataset?
Co-authored-by: melange396 <[email protected]>
…:cmu-delphi/delphi-epidata into krivard/covid_hosp-facility-running-time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent! now we wait and see how much running time is saved on prod 🍿
Prerequisites:
dev
branchdev
Summary
covid_hosp facility experienced very long running times due to a change in the incoming format of geo code values which was made to two hospitals in the source data. The change was likely human error, since it increases the precision of the lat/long values way beyond millimeter accuracy.
This PR makes several changes to properly handle the incoming formatting error, speed up acquisition, and make it easier to debug future problems in this pipeline:
execute
toexecutemany
, resulting in a ~1/3 speedup (15 minutes to 9 minutes on a 2022-12 dataset)truncate
for the facility key table in integration testsOriginal summary of the draft PR...
covid_hosp is experiencing very long running times in the facility dataset (6-8h). The only suspicious bits in the log are
This PR attempts to provide more information about the invalid column and see if it is a possible cause for the extended running time. It includes a logging refactor to share functionality with the covidcast logger.