-
Notifications
You must be signed in to change notification settings - Fork 982
[DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading #18968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.08
Are you sure you want to change the base?
Conversation
…in host pinned memory Signed-off-by: Jigao Luo <[email protected]>
…ion with preparing a pinned_vector Signed-off-by: Jigao Luo <[email protected]>
…ion with preparing a pinned_vector Signed-off-by: Jigao Luo <[email protected]>
…nction with preparing a pinned_vector Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
…-sync in parquet reading Signed-off-by: Jigao Luo <[email protected]>
…-sync in parquet reading Signed-off-by: Jigao Luo <[email protected]>
…form and then reduce Signed-off-by: Jigao Luo <[email protected]>
…writing Signed-off-by: Jigao Luo <[email protected]>
…writing Signed-off-by: Jigao Luo <[email protected]>
…writing Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
… in host pinned memory Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
….hpp later) Signed-off-by: Jigao Luo <[email protected]>
…ater) Signed-off-by: Jigao Luo <[email protected]>
…ies.hpp later) Signed-off-by: Jigao Luo <[email protected]>
…ater) Signed-off-by: Jigao Luo <[email protected]>
…ies.hpp later) Signed-off-by: Jigao Luo <[email protected]>
…ories.hpp later) Signed-off-by: Jigao Luo <[email protected]>
…ater) Signed-off-by: Jigao Luo <[email protected]>
…ctories.hpp later) Signed-off-by: Jigao Luo <[email protected]>
…ies.hpp later) Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
|
Hi @vuule @mhaseeb123 @wence- @GregoryKimball,
|
|
Hi @vuule @mhaseeb123 @wence- @GregoryKimball, I plotted my benchmark as the style from Greg's comment #15620 (comment) . The PR draft is fully compilable, so feel free to test it when you have time. You can also find my simple command and details here:How many miss-sync are saved?You will see where this speedup comes from and how nasty the miss-sync can be: UpstreamThis PRAnd all those miss-sync are coming from the Parquet writer. No pageable memcpy exists in the Parquet reader with this patch. CommandWith for t in 1 2 4 8 16 32 64 128;
do
./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=$((1024 * 1024 * 128 * t)) --axis num_threads=$t --axis num_iterations=10 --csv <PATH>;
doneAnd I use this same command on this PR draft as well as on the upstream branch 25-08 to generate two sets of CSV files. Hardware setup |
…19020) Related to #18968 (comment) This PR updates the `batched_memset` cuIO utility to take in a `host_span` type argument instead of a `std::vector` to allow using `cudf::host_vectors` or `cudf::pinned_vectors` in the future as input. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #19020
…19020) Related to #18968 (comment) This PR updates the `batched_memset` cuIO utility to take in a `host_span` type argument instead of a `std::vector` to allow using `cudf::host_vectors` or `cudf::pinned_vectors` in the future as input. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #19020
… (Part 1: device_scalar) (#19055) For issue #18967, this PR is the first part of merging the PR Draft #18968. In this PR, `device_scalar` utilizes explicitly host pinned memory as its internal bounce buffer. Authors: - Jigao Luo (https://github.com/JigaoLuo) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Shruti Shivakumar (https://github.com/shrshi) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #19055
#19092) Contributes to #18967, part of #18968 In this PR, `hostdevice_vector::element` is removed due to its internal `cudaMemcpy` into host pageable memory. Also, the only call in it is replaced manually. Authors: - Jigao Luo (https://github.com/JigaoLuo) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) URL: #19092
This comment was marked as outdated.
This comment was marked as outdated.
…o unnecessary synchronization (Part 3 of miss-sync) (#19119) For issue #18967, this PR is one part of merging the PR Draft #18968. In this PR, almost all `rmm::device_scalar` calls in libcudf are replaced with `cudf::detail::device_scalar` due to its internal host-pinned bounce buffer. This is also a call to action to use host-pinned memory globally in libcudf, with arguments stated in #18967 and #18968. Authors: - Jigao Luo (https://github.com/JigaoLuo) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) URL: #19119
| // copy offsets and buff_addrs into host pinned memory | ||
| auto host_pinned_offsets = cudf::detail::make_pinned_vector_async<size_type>(offsets, stream); | ||
| auto host_pinned_buff_addrs = | ||
| cudf::detail::make_pinned_vector_async<size_type*>(buff_addrs, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Host to Host copy (with make_pinned_vector_async)
| CUDF_CUDA_TRY(cudaMemcpyAsync( | ||
| host_scalar.data(), &initial_value, sizeof(OutputType), cudaMemcpyHostToHost, stream.value())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Host to Host copy
|
Note: There may be a bug in this draft or somewhere in the codebase. I ran millions of read operations over a 10-hour period—purely reading—and encountered a single instance of incorrect results, with an estimated trigger rate of just 0.0001%. I can not be sure where the bug is: is it on my draft or in the cudf code (I am using branch-25.08 still). I’m leaving this note here as a reminder for myself, and we can consider adding more checks during the merging for the rest of this draft. [Selfnote Update] I’ve observed the same bug even without the patch. The reproduction now seems tied to my metadata caching PR. I’ll allocate time to reproduce it more reliably. |

Description
For the issue #18967, this is a PR Draft aimed at removing all unnecessary synchronization points (termed "miss-sync") in the Parquet reader. Please hold off on merging this PR draft. The plan is to split it into smaller PRs for the actual merge.
TL;DR: This is the performance gain in scalability 🚀 once the future small PRs from this draft are merged:
Checklist