covid_hosp improvements to address and investigate long update running times #1083

krivard · 2023-02-13T15:53:18Z

Prerequisites:

Unless it is a documentation hotfix it should be merged against the dev branch
Branch is up-to-date with the branch to be merged with, i.e. dev
Build is successful
Code is cleaned up and formatted

Summary

covid_hosp facility experienced very long running times due to a change in the incoming format of geo code values which was made to two hospitals in the source data. The change was likely human error, since it increases the precision of the lat/long values way beyond millimeter accuracy.

This PR makes several changes to properly handle the incoming formatting error, speed up acquisition, and make it easier to debug future problems in this pipeline:

Moves logger.py from acquisition.covidcast to a new submodule, acquisition.common
Replaces all print statements in covid_hosp with logger statements
Adds structured logging to all substantial covid_hosp pipeline phases
Adds new Python type for length-limited geo codes, which can truncate mistakenly-high precision in incoming geo codes
Switches from execute to executemany, resulting in a ~1/3 speedup (15 minutes to 9 minutes on a 2022-12 dataset)
Updates tests to match
Fixes a missed truncate for the facility key table in integration tests

Original summary of the draft PR...

covid_hosp is experiencing very long running times in the facility dataset (6-8h). The only suspicious bits in the log are

Traceback (most recent call last):
  File "/home/automation/.pyenv/versions/3.8.2/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/automation/.pyenv/versions/3.8.2/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/automation/driver/delphi/epidata/acquisition/covid_hosp/facility/update.py", line 42, in <module>
    Utils.launch_if_main(Update.run, __name__)
  File "/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/utils.py", line 37, in launch_if_main
    entrypoint()
  File "/home/automation/driver/delphi/epidata/acquisition/covid_hosp/facility/update.py", line 38, in run
    return Utils.update_dataset(Database, network)
  File "/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/utils.py", line 197, in update_dataset
    db.insert_dataset(issue_int, dataset)
  File "/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/database.py", line 198, in insert_dataset
    cursor.execute(sql,
  File "/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/cursor_cext.py", line 264, in execute
    result = self._cnx.cmd_query(stmt, raw=self._raw,
  File "/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/connection_cext.py", line 491, in cmd_query
    raise errors.get_mysql_exception(exc.errno, msg=exc.msg,
mysql.connector.errors.DataError: 1406 (22001): Data too long for column 'geocoded_hospital_address' at row 1

This PR attempts to provide more information about the invalid column and see if it is a possible cause for the extended running time. It includes a logging refactor to share functionality with the covidcast logger.

Moves logger.py from acquisition.covidcast to a new submodule, acquisition.common (not quite what make a "src/common/" directory (package?) to share between server and acquisition code #1037 is asking for but perhaps related, so i'm linking it here)
Replaces all print statements in covid_hosp with logger statements
Adds logging to potential culprits for long running time
Adds new Python type for length-limited strings

* includes deployment

* also adds some previously-absent logging

src/acquisition/covid_hosp/common/database.py

src/acquisition/covid_hosp/common/utils.py

src/acquisition/covid_hosp/facility/database.py

* Switch to executemany * Add new limited_geocode datatype * Fix test prep missing from #1030

…ility-running-time

melange396

might we want to consider making all the logger arguments non-optional, which will save us from having to check if logger every time its used?

sorry that took as long as it did. i pulled this branch to try some things semi-related to this and semi-related to the auto column work/exploration, then ran into problems with my build environment that took a while to track down (which ended up being massive file artifacts in my repo tree that docker copied around all over the place).

melange396 · 2023-02-27T18:55:00Z

tests/acquisition/covid_hosp/common/test_utils.py

@@ -97,7 +97,7 @@ def test_run_skip_old_dataset(self):
    mock_network = MagicMock()
    mock_network.fetch_metadata.return_value = \
        self.test_utils.load_sample_metadata()
-    mock_database = MagicMock()
+    mock_database = MagicMock(**{"__module__":"test_module", "__name__":"test_name"})


does this not work?

Suggested change

mock_database = MagicMock(**{"__module__":"test_module", "__name__":"test_name"})

mock_database = MagicMock(__module__="test_module", __name__="test_name")

oh! it does! i must've assumed __name__ was reserved

src/acquisition/covid_hosp/common/utils.py

melange396 · 2023-02-27T21:10:26Z

src/acquisition/covid_hosp/common/database.py

@@ -242,4 +268,6 @@ def get_max_issue(self):
      for (result,) in cursor:
        if result is not None:
          return pd.Timestamp(str(result))
+      if logger:
+        logger.info("get_max_issue", msg="no matching results in meta table; returning 1900/1/1 epoch")


maybe a warning-level call, since this should only happen ~once in the lifetime of a dataset?

src/acquisition/covid_hosp/common/utils.py

src/acquisition/covid_hosp/common/database.py

src/acquisition/covid_hosp/common/utils.py

Co-authored-by: melange396 <[email protected]>

…:cmu-delphi/delphi-epidata into krivard/covid_hosp-facility-running-time

melange396

excellent! now we wait and see how much running time is saved on prod 🍿

krivard added 2 commits February 13, 2023 10:49

Move logger module to acquisition.common

2919a58

* includes deployment

Convert covid_hosp to use structured logger

a55cc10

* also adds some previously-absent logging

krivard changed the title ~~Convert covid_hosp to use structured logging~~ covid_hosp improvements to address and investigate long update running times Feb 13, 2023

krivard requested a review from melange396 February 13, 2023 16:06

krivard marked this pull request as draft February 13, 2023 16:21

melange396 reviewed Feb 13, 2023

View reviewed changes

src/acquisition/covid_hosp/common/database.py Show resolved Hide resolved

src/acquisition/covid_hosp/common/utils.py Show resolved Hide resolved

src/acquisition/covid_hosp/facility/database.py Outdated Show resolved Hide resolved

krivard added 2 commits February 27, 2023 11:44

[covid_hosp:f] Fix geocode acquisition and running time.

158d8aa

* Switch to executemany * Add new limited_geocode datatype * Fix test prep missing from #1030

Merge remote-tracking branch 'origin/dev' into krivard/covid_hosp-fac…

c1db533

…ility-running-time

krivard marked this pull request as ready for review February 27, 2023 16:56

krivard requested a review from melange396 February 27, 2023 16:56

melange396 reviewed Mar 2, 2023

View reviewed changes

krivard and others added 3 commits March 3, 2023 11:12

Move logger construction into Database base class

221e53d

Apply suggestions from code review

2544fa3

Co-authored-by: melange396 <[email protected]>

Merge branch 'krivard/covid_hosp-facility-running-time' of github.com…

5d8f3fc

…:cmu-delphi/delphi-epidata into krivard/covid_hosp-facility-running-time

krivard requested a review from melange396 March 3, 2023 16:15

melange396 approved these changes Mar 3, 2023

View reviewed changes

krivard merged commit fdf38cc into dev Mar 3, 2023

krivard deleted the krivard/covid_hosp-facility-running-time branch March 3, 2023 19:57

krivard mentioned this pull request Mar 7, 2023

Release Delphi Epidata 0.4.7 #1105

Merged

melange396 mentioned this pull request Mar 16, 2023

covid_hosp update running time improvements -- part 2 #1111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

covid_hosp improvements to address and investigate long update running times #1083

covid_hosp improvements to address and investigate long update running times #1083

Uh oh!

krivard commented Feb 13, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melange396 left a comment

Uh oh!

melange396 Feb 27, 2023

Uh oh!

krivard Mar 3, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melange396 Feb 27, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melange396 left a comment

Uh oh!

Uh oh!

	mock_database = MagicMock(**{"__module__":"test_module", "__name__":"test_name"})
	mock_database = MagicMock(__module__="test_module", __name__="test_name")

covid_hosp improvements to address and investigate long update running times #1083

covid_hosp improvements to address and investigate long update running times #1083

Uh oh!

Conversation

krivard commented Feb 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melange396 left a comment

Choose a reason for hiding this comment

Uh oh!

melange396 Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

krivard Mar 3, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melange396 Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melange396 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

krivard commented Feb 13, 2023 •

edited

Loading