-
Notifications
You must be signed in to change notification settings - Fork 983
Replace rmm::device_scalar with cudf::detail::device_scalar due to unnecessary synchronization (Part 3 of miss-sync)
#19119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace rmm::device_scalar with cudf::detail::device_scalar due to unnecessary synchronization (Part 3 of miss-sync)
#19119
Conversation
…other changes for linker and compiler) Signed-off-by: Jigao Luo <[email protected]>
…erator.cuh Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
|
I also updated So, when is such replacement not feasible?Only one instance of cudf/cpp/src/groupby/hash/compute_aggregations.cuh Lines 108 to 126 in 44ee9e5
Replacing it is not feasible as The error log if this replacement is attempted:Non-Parquet-related ChangesChanges in these following files do not impact Parquet:
Replacements were still made for consistency with the |
|
/ok to test 04e5415 |
vuule
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
Some of the changes need to be reverted, the rest looks good 👍
Do you know which of these caused pageable copies in the Parquet reader?
This reverts commit 06ae1d2.
This comment was marked as resolved.
This comment was marked as resolved.
…uh (and other changes for linker and compiler)" This reverts commit 7f0785a.
|
Addressing rapidsai/rmm#1955 would help improve this issue with rmm directly (we shouldn't block this PR on that making that change though, it's a long-term suggestion for improvement). |
…symbolerror Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
I think this should better wait until #19608 merges in, then rebase before merging. |
This reverts commit cc6baa9.
23b400c to
be5c1f4
Compare
|
I’ve reverted the previous changes in Additionally, I ran a similar profiling to #18968 (comment) and observed a slight drop in the number of mis-sync: The command:for t in 1 2 4 8 16 32 64 128;
do
./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=$((1024 * 1024 * 128 * t)) --axis num_threads=$t --axis num_iterations=10 --csv <PATH>;
done
# then with nsys profile
nsys export -t sqlite report{thread-case}.nsys-rep
nsys analyze -r cuda_memcpy_async:rows=-1 report{thread-case}.sqlite| wc -lThe current branch 25.10: Before this PR$ nsys analyze -r cuda_memcpy_async:rows=-1 report_upstream_8threads.sqlite | wc -l
63181With this PR$ nsys analyze -r cuda_memcpy_async:rows=-1 report_PR_8threads.sqlite | wc -l
63021 |
|
The target number we're aiming for is the one mentioned in the draft PR comment : ~301, based on the same binary benchmark run. I don’t recall the exact impact of each individual change on the reduction of mis-syncs, but from what I observed, the thrust reduction seemed to trigger the most mis-synchronizations. There are also a lot of calls of thrust reduction. Apologies for the delay—I wasn’t able to find time over the past two days. I also have a major paper deadline coming up, so I want to apologize in advance for any slow responses. |
|
Now I turn this PR Draft into a PR. But let's wait for the #19608 |
Fixes the `cudf::reduction::detail::reduce` internal utility to use the returned `cudf::scalar` instances directly in the CUB calls to simplify the logic. This should help solve the issues for building/running #19119 -- the device-scalar ctors are no longer required. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Tianyu Liu (https://github.com/kingcrimsontianyu) - Muhammad Haseeb (https://github.com/mhaseeb123) - Nghia Truong (https://github.com/ttnghia) URL: #19608
|
Ok, #19608 is merge. I think this will be good to go once the merge conflict is resolved here. |
…c-cudf-devicescalar
|
Thanks for the notification. I’ve resolved the conflict. |
|
/ok to test 6708c96 |
vuule
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating on this.
|
Thanks again for your help—really appreciate it! Also, big thanks to everyone here, including RMM forks. |
|
/ok to test 5fc3411 |
|
/merge |
Description
For issue #18967, this PR is one part of merging the PR Draft #18968. In this PR, almost all
rmm::device_scalarcalls in libcudf are replaced withcudf::detail::device_scalardue to its internal host-pinned bounce buffer.This is also a call to action to use host-pinned memory globally in libcudf, with arguments stated in #18967 and #18968.
Checklist