Skip to content

covid_hosp update running time improvements -- part 2 #1111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 22, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion src/acquisition/covid_hosp/common/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,8 @@ def merge_by_key_cols(dfs, key_cols, logger=False):
## repeated concatenation in pandas is expensive, but (1) we don't expect
## batch sizes to be terribly large (7 files max) and (2) this way we can
## more easily capture the next iteration's updates to any new keys
new_rows = df.loc[[i for i in df.index.to_list() if i not in result.index.to_list()]]
result_index_set = set(result.index.to_list())
new_rows = df.loc[[i for i in df.index.to_list() if i not in result_index_set]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion (non-blocking): wonder if pandas.Index.difference would help here. I dug around under the hood and it looks like Pandas removes duplicates in both indexes and then punts to the Numpy function in1d, which uses a couple other tricks. Might not be C-fast since the keys here are probably strings, so Numpy will be holding Python object pointers, but will avoid the Numpy -> Python list conversion (in to_list), and it looks pretty clean.

new_rows = df.loc[[df.index.difference(result.index)]]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, you tell me NOW after i dismantled all the scaffolding i had to test this stuff with real data... 😂

that looks like it would certainly do the trick and is probably just as fast (if not faster), but it does a few weird in treating the arguments as sets (sorting or throwing away ordering). what's in this PR so far should produce equivalent results to what it replaces, and its fast compared to the DB operations that come later in the pipeline (and its extremely faster than what was there before), so im happy to leave it as-is for now.

result = pd.concat([result, new_rows])

# convert the index rows back to columns
Expand Down