Skip to content

covid_hosp update running time improvements -- part 2 #1111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 22, 2023

Conversation

melange396
Copy link
Collaborator

aka "fixin' 2: electric boogaloo"

a follow-up to #1083, as we are still seeing very long running time for covid_hosp_facility acquisition in production.

@melange396
Copy link
Collaborator Author

the first commit is a change from a hard quadratic to a roughly linear operation in a dataset merging method. i suspect more commits to this PR are possible, depending on the outcome of some experiments im running...

@melange396
Copy link
Collaborator Author

i hypothesized that performance problems mightve been happening in the operations in this section. i expected the problematic comprehension step (the one i changed in the commit above) to be an expensive one, but i thought at least one of the other 3 dataframe manipulations that happen around it there would also significantly contribute to the running time -- that is not the case. in experiments in my testing environment, they all took ~seconds to run, but that problematic step took ~10h (twice what we saw in prod). after my fix, it also takes just seconds.

@melange396 melange396 marked this pull request as ready for review March 20, 2023 19:29
@@ -160,7 +160,8 @@ def merge_by_key_cols(dfs, key_cols, logger=False):
## repeated concatenation in pandas is expensive, but (1) we don't expect
## batch sizes to be terribly large (7 files max) and (2) this way we can
## more easily capture the next iteration's updates to any new keys
new_rows = df.loc[[i for i in df.index.to_list() if i not in result.index.to_list()]]
result_index_set = set(result.index.to_list())
new_rows = df.loc[[i for i in df.index.to_list() if i not in result_index_set]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion (non-blocking): wonder if pandas.Index.difference would help here. I dug around under the hood and it looks like Pandas removes duplicates in both indexes and then punts to the Numpy function in1d, which uses a couple other tricks. Might not be C-fast since the keys here are probably strings, so Numpy will be holding Python object pointers, but will avoid the Numpy -> Python list conversion (in to_list), and it looks pretty clean.

new_rows = df.loc[[df.index.difference(result.index)]]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, you tell me NOW after i dismantled all the scaffolding i had to test this stuff with real data... 😂

that looks like it would certainly do the trick and is probably just as fast (if not faster), but it does a few weird in treating the arguments as sets (sorting or throwing away ordering). what's in this PR so far should produce equivalent results to what it replaces, and its fast compared to the DB operations that come later in the pipeline (and its extremely faster than what was there before), so im happy to leave it as-is for now.

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

@melange396 melange396 merged commit e54eb98 into dev Mar 22, 2023
@melange396 melange396 deleted the c_h_facility_runningtime_part_deux branch March 22, 2023 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants