-
Notifications
You must be signed in to change notification settings - Fork 67
covid_hosp update running time improvements -- part 2 #1111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
the first commit is a change from a hard quadratic to a roughly linear operation in a dataset merging method. i suspect more commits to this PR are possible, depending on the outcome of some experiments im running... |
i hypothesized that performance problems mightve been happening in the operations in this section. i expected the problematic comprehension step (the one i changed in the commit above) to be an expensive one, but i thought at least one of the other 3 dataframe manipulations that happen around it there would also significantly contribute to the running time -- that is not the case. in experiments in my testing environment, they all took ~seconds to run, but that problematic step took ~10h (twice what we saw in prod). after my fix, it also takes just seconds. |
@@ -160,7 +160,8 @@ def merge_by_key_cols(dfs, key_cols, logger=False): | |||
## repeated concatenation in pandas is expensive, but (1) we don't expect | |||
## batch sizes to be terribly large (7 files max) and (2) this way we can | |||
## more easily capture the next iteration's updates to any new keys | |||
new_rows = df.loc[[i for i in df.index.to_list() if i not in result.index.to_list()]] | |||
result_index_set = set(result.index.to_list()) | |||
new_rows = df.loc[[i for i in df.index.to_list() if i not in result_index_set]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion (non-blocking): wonder if pandas.Index.difference
would help here. I dug around under the hood and it looks like Pandas removes duplicates in both indexes and then punts to the Numpy function in1d
, which uses a couple other tricks. Might not be C-fast since the keys here are probably strings, so Numpy will be holding Python object pointers, but will avoid the Numpy -> Python list conversion (in to_list), and it looks pretty clean.
new_rows = df.loc[[df.index.difference(result.index)]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, you tell me NOW after i dismantled all the scaffolding i had to test this stuff with real data... 😂
that looks like it would certainly do the trick and is probably just as fast (if not faster), but it does a few weird in treating the arguments as sets (sorting or throwing away ordering). what's in this PR so far should produce equivalent results to what it replaces, and its fast compared to the DB operations that come later in the pipeline (and its extremely faster than what was there before), so im happy to leave it as-is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
aka "fixin' 2: electric boogaloo"
a follow-up to #1083, as we are still seeing very long running time for
covid_hosp_facility
acquisition in production.