Skip to content

Conversation

@JigaoLuo
Copy link
Contributor

@JigaoLuo JigaoLuo commented May 26, 2025

Description

For the issue #18967, this is a PR Draft aimed at removing all unnecessary synchronization points (termed "miss-sync") in the Parquet reader. Please hold off on merging this PR draft. The plan is to split it into smaller PRs for the actual merge.

TL;DR: This is the performance gain in scalability 🚀 once the future small PRs from this draft are merged:

image

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

JigaoLuo added 29 commits May 25, 2025 17:38
…ion with preparing a pinned_vector

Signed-off-by: Jigao Luo <[email protected]>
…ion with preparing a pinned_vector

Signed-off-by: Jigao Luo <[email protected]>
…nction with preparing a pinned_vector

Signed-off-by: Jigao Luo <[email protected]>
@JigaoLuo JigaoLuo requested a review from a team as a code owner May 26, 2025 22:21
@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented May 28, 2025

Hi @vuule @mhaseeb123 @wence- @GregoryKimball,
Thanks for the review and discussion! I think there is still a need for discussion. After that, I’ll address the feedback in separate PRs after we finalize the details. I’ll definitely make time to merge this in my free time.

I’ll also try to explore performance optimizations in benchmarks. [Update: you can find the benchmark in the following message]

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented May 29, 2025

Hi @vuule @mhaseeb123 @wence- @GregoryKimball,
I’ve run preliminary benchmarks, and as expected, the PR draft shows significant 🚀 gains in scenarios with frequent synchronization stalls (“miss-sync”). This is particularly evident when each thread reads multiple small-sized, concurrent file segments: a use case, for example, each thread reading different row groups from a large Parquet file. If I can convince you with this use case, then we are one step closer to a SpeedOfLight Parquet reader.

I plotted my benchmark as the style from Greg's comment #15620 (comment) . The PR draft is fully compilable, so feel free to test it when you have time.

image

You can also find my simple command and details here:

How many miss-sync are saved?

You will see where this speedup comes from and how nasty the miss-sync can be:

Upstream

$ $ nsys export --output report_upstream_8threads.sqlite --type sqlite report_upstream_8threads.nsys-rep 
$  nsys analyze -r cuda_memcpy_async:rows=-1 report_upstream_8threads.sqlite | wc -l
166861

This PR

$ nsys analyze -r cuda_memcpy_async:rows=-1 report_pr_8threads.sqlite | wc -l
301

And all those miss-sync are coming from the Parquet writer. No pageable memcpy exists in the Parquet reader with this patch.

Command

With num_iterations=10, I let each thread read 10 times of size 128MB to mimic the use case and also to create more miss-sync.

for t in 1 2 4 8 16 32 64 128; 
do 
  ./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=$((1024 * 1024 * 128 * t)) --axis num_threads=$t --axis num_iterations=10 --csv <PATH>;
done

And I use this same command on this PR draft as well as on the upstream branch 25-08 to generate two sets of CSV files.

Hardware setup

RMM memory resource = pool
CUIO host memory resource = pinned_pool
# Devices

## [0] `NVIDIA A100-SXM4-40GB`
* SM Version: 800 (PTX Version: 800)
* Number of SMs: 108
* SM Default Clock Rate: 1410 MHz
* Global Memory: 19704 MiB Free / 40339 MiB Total
* Global Memory Bus Peak: 1555 GB/sec (5120-bit DDR @1215MHz)
* Max Shared Memory: 164 KiB/SM, 48 KiB/Block
* L2 Cache Size: 40960 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: Yes

rapids-bot bot pushed a commit that referenced this pull request May 29, 2025
…19020)

Related to #18968 (comment)
This PR updates the `batched_memset` cuIO utility to take in a `host_span` type argument instead of a `std::vector` to allow using `cudf::host_vectors` or `cudf::pinned_vectors` in the future as input.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #19020
copy-pr-bot bot pushed a commit that referenced this pull request May 30, 2025
…19020)

Related to #18968 (comment)
This PR updates the `batched_memset` cuIO utility to take in a `host_span` type argument instead of a `std::vector` to allow using `cudf::host_vectors` or `cudf::pinned_vectors` in the future as input.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #19020
@JigaoLuo JigaoLuo changed the title Remove unnecessary synchronization (miss-sync) during Parquet reading [DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading Jun 1, 2025
rapids-bot bot pushed a commit that referenced this pull request Jun 4, 2025
… (Part 1: device_scalar) (#19055)

For issue #18967, this PR is the first part of merging the PR Draft #18968. In this PR, `device_scalar` utilizes explicitly host pinned memory as its internal bounce buffer.

Authors:
  - Jigao Luo (https://github.com/JigaoLuo)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Shruti Shivakumar (https://github.com/shrshi)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #19055
rapids-bot bot pushed a commit that referenced this pull request Jul 23, 2025
#19092)

Contributes to #18967, part of #18968 

In this PR, `hostdevice_vector::element` is removed due to its internal `cudaMemcpy` into host pageable memory. Also, the only call in it is replaced manually.

Authors:
  - Jigao Luo (https://github.com/JigaoLuo)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #19092
@JigaoLuo

This comment was marked as outdated.

rapids-bot bot pushed a commit that referenced this pull request Aug 12, 2025
…o unnecessary synchronization (Part 3 of miss-sync) (#19119)

For issue #18967, this PR is one part of merging the PR Draft #18968. In this PR, almost all `rmm::device_scalar` calls in libcudf are replaced with `cudf::detail::device_scalar` due to its internal host-pinned bounce buffer. 

This is also a call to action to use host-pinned memory globally in libcudf, with arguments stated in #18967 and #18968.

Authors:
  - Jigao Luo (https://github.com/JigaoLuo)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)

URL: #19119
Comment on lines +547 to +550
// copy offsets and buff_addrs into host pinned memory
auto host_pinned_offsets = cudf::detail::make_pinned_vector_async<size_type>(offsets, stream);
auto host_pinned_buff_addrs =
cudf::detail::make_pinned_vector_async<size_type*>(buff_addrs, stream);
Copy link
Contributor Author

@JigaoLuo JigaoLuo Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host to Host copy (with make_pinned_vector_async)

Comment on lines +72 to +73
CUDF_CUDA_TRY(cudaMemcpyAsync(
host_scalar.data(), &initial_value, sizeof(OutputType), cudaMemcpyHostToHost, stream.value()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host to Host copy

@mhaseeb123 mhaseeb123 removed this from the Parquet continuous improvement milestone Aug 20, 2025
@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Sep 21, 2025

Note: There may be a bug in this draft or somewhere in the codebase. I ran millions of read operations over a 10-hour period—purely reading—and encountered a single instance of incorrect results, with an estimated trigger rate of just 0.0001%. I can not be sure where the bug is: is it on my draft or in the cudf code (I am using branch-25.08 still).

I’m leaving this note here as a reminder for myself, and we can consider adding more checks during the merging for the rest of this draft.

[Selfnote Update] I’ve observed the same bug even without the patch. The reproduction now seems tied to my metadata caching PR. I’ll allocate time to reproduce it more reliably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libcudf Affects libcudf (C++/CUDA) code.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants