Skip to content

covid_hosp improvements to address and investigate long update running times #1083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 3, 2023

Conversation

krivard
Copy link
Contributor

@krivard krivard commented Feb 13, 2023

Prerequisites:

  • Unless it is a documentation hotfix it should be merged against the dev branch
  • Branch is up-to-date with the branch to be merged with, i.e. dev
  • Build is successful
  • Code is cleaned up and formatted

Summary

covid_hosp facility experienced very long running times due to a change in the incoming format of geo code values which was made to two hospitals in the source data. The change was likely human error, since it increases the precision of the lat/long values way beyond millimeter accuracy.

This PR makes several changes to properly handle the incoming formatting error, speed up acquisition, and make it easier to debug future problems in this pipeline:

  • Moves logger.py from acquisition.covidcast to a new submodule, acquisition.common
  • Replaces all print statements in covid_hosp with logger statements
  • Adds structured logging to all substantial covid_hosp pipeline phases
  • Adds new Python type for length-limited geo codes, which can truncate mistakenly-high precision in incoming geo codes
  • Switches from execute to executemany, resulting in a ~1/3 speedup (15 minutes to 9 minutes on a 2022-12 dataset)
  • Updates tests to match
  • Fixes a missed truncate for the facility key table in integration tests
Original summary of the draft PR...

covid_hosp is experiencing very long running times in the facility dataset (6-8h). The only suspicious bits in the log are

Traceback (most recent call last):
  File "/home/automation/.pyenv/versions/3.8.2/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/automation/.pyenv/versions/3.8.2/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/automation/driver/delphi/epidata/acquisition/covid_hosp/facility/update.py", line 42, in <module>
    Utils.launch_if_main(Update.run, __name__)
  File "/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/utils.py", line 37, in launch_if_main
    entrypoint()
  File "/home/automation/driver/delphi/epidata/acquisition/covid_hosp/facility/update.py", line 38, in run
    return Utils.update_dataset(Database, network)
  File "/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/utils.py", line 197, in update_dataset
    db.insert_dataset(issue_int, dataset)
  File "/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/database.py", line 198, in insert_dataset
    cursor.execute(sql,
  File "/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/cursor_cext.py", line 264, in execute
    result = self._cnx.cmd_query(stmt, raw=self._raw,
  File "/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/connection_cext.py", line 491, in cmd_query
    raise errors.get_mysql_exception(exc.errno, msg=exc.msg,
mysql.connector.errors.DataError: 1406 (22001): Data too long for column 'geocoded_hospital_address' at row 1

This PR attempts to provide more information about the invalid column and see if it is a possible cause for the extended running time. It includes a logging refactor to share functionality with the covidcast logger.

* also adds some previously-absent logging
@krivard krivard changed the title Convert covid_hosp to use structured logging covid_hosp improvements to address and investigate long update running times Feb 13, 2023
@krivard krivard requested a review from melange396 February 13, 2023 16:06
@krivard krivard marked this pull request as draft February 13, 2023 16:21
* Switch to executemany
* Add new limited_geocode datatype
* Fix test prep missing from #1030
@krivard krivard marked this pull request as ready for review February 27, 2023 16:56
@krivard krivard requested a review from melange396 February 27, 2023 16:56
Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might we want to consider making all the logger arguments non-optional, which will save us from having to check if logger every time its used?

sorry that took as long as it did. i pulled this branch to try some things semi-related to this and semi-related to the auto column work/exploration, then ran into problems with my build environment that took a while to track down (which ended up being massive file artifacts in my repo tree that docker copied around all over the place).

@@ -97,7 +97,7 @@ def test_run_skip_old_dataset(self):
mock_network = MagicMock()
mock_network.fetch_metadata.return_value = \
self.test_utils.load_sample_metadata()
mock_database = MagicMock()
mock_database = MagicMock(**{"__module__":"test_module", "__name__":"test_name"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not work?

Suggested change
mock_database = MagicMock(**{"__module__":"test_module", "__name__":"test_name"})
mock_database = MagicMock(__module__="test_module", __name__="test_name")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! it does! i must've assumed __name__ was reserved

@@ -242,4 +268,6 @@ def get_max_issue(self):
for (result,) in cursor:
if result is not None:
return pd.Timestamp(str(result))
if logger:
logger.info("get_max_issue", msg="no matching results in meta table; returning 1900/1/1 epoch")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a warning-level call, since this should only happen ~once in the lifetime of a dataset?

@krivard krivard requested a review from melange396 March 3, 2023 16:15
Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent! now we wait and see how much running time is saved on prod 🍿

@krivard krivard merged commit fdf38cc into dev Mar 3, 2023
@krivard krivard deleted the krivard/covid_hosp-facility-running-time branch March 3, 2023 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants