covid_hosp update running time improvements -- part 2 #1111

melange396 · 2023-03-16T20:45:46Z

aka "fixin' 2: electric boogaloo"

a follow-up to #1083, as we are still seeing very long running time for covid_hosp_facility acquisition in production.

melange396 · 2023-03-16T20:50:46Z

the first commit is a change from a hard quadratic to a roughly linear operation in a dataset merging method. i suspect more commits to this PR are possible, depending on the outcome of some experiments im running...

melange396 · 2023-03-20T19:29:18Z

i hypothesized that performance problems mightve been happening in the operations in this section. i expected the problematic comprehension step (the one i changed in the commit above) to be an expensive one, but i thought at least one of the other 3 dataframe manipulations that happen around it there would also significantly contribute to the running time -- that is not the case. in experiments in my testing environment, they all took ~seconds to run, but that problematic step took ~10h (twice what we saw in prod). after my fix, it also takes just seconds.

dshemetov · 2023-03-20T20:50:10Z

src/acquisition/covid_hosp/common/utils.py

@@ -160,7 +160,8 @@ def merge_by_key_cols(dfs, key_cols, logger=False):
      ## repeated concatenation in pandas is expensive, but (1) we don't expect
      ## batch sizes to be terribly large (7 files max) and (2) this way we can
      ## more easily capture the next iteration's updates to any new keys
-      new_rows = df.loc[[i for i in df.index.to_list() if i not in result.index.to_list()]]
+      result_index_set = set(result.index.to_list())
+      new_rows = df.loc[[i for i in df.index.to_list() if i not in result_index_set]]


Minor suggestion (non-blocking): wonder if pandas.Index.difference would help here. I dug around under the hood and it looks like Pandas removes duplicates in both indexes and then punts to the Numpy function in1d, which uses a couple other tricks. Might not be C-fast since the keys here are probably strings, so Numpy will be holding Python object pointers, but will avoid the Numpy -> Python list conversion (in to_list), and it looks pretty clean.

new_rows = df.loc[[df.index.difference(result.index)]]

ugh, you tell me NOW after i dismantled all the scaffolding i had to test this stuff with real data... 😂

that looks like it would certainly do the trick and is probably just as fast (if not faster), but it does a few weird in treating the arguments as sets (sorting or throwing away ordering). what's in this PR so far should produce equivalent results to what it replaces, and its fast compared to the DB operations that come later in the pipeline (and its extremely faster than what was there before), so im happy to leave it as-is for now.

krivard

Great catch!

covid_hosp update merging speedup: n^2-->n

aa146c0

melange396 requested review from krivard and korlaxxalrok March 16, 2023 20:45

melange396 marked this pull request as ready for review March 20, 2023 19:29

dshemetov reviewed Mar 20, 2023

View reviewed changes

krivard approved these changes Mar 22, 2023

View reviewed changes

melange396 merged commit e54eb98 into dev Mar 22, 2023

melange396 deleted the c_h_facility_runningtime_part_deux branch March 22, 2023 19:23

krivard mentioned this pull request Apr 4, 2023

Release Delphi Epidata 0.4.8 #1119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

covid_hosp update running time improvements -- part 2 #1111

covid_hosp update running time improvements -- part 2 #1111

Uh oh!

melange396 commented Mar 16, 2023

Uh oh!

melange396 commented Mar 16, 2023

Uh oh!

melange396 commented Mar 20, 2023

Uh oh!

dshemetov Mar 20, 2023

Uh oh!

melange396 Mar 20, 2023

Uh oh!

krivard left a comment

Uh oh!

Uh oh!

covid_hosp update running time improvements -- part 2 #1111

covid_hosp update running time improvements -- part 2 #1111

Uh oh!

Conversation

melange396 commented Mar 16, 2023

Uh oh!

melange396 commented Mar 16, 2023

Uh oh!

melange396 commented Mar 20, 2023

Uh oh!

dshemetov Mar 20, 2023

Choose a reason for hiding this comment

Uh oh!

melange396 Mar 20, 2023

Choose a reason for hiding this comment

Uh oh!

krivard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!