[DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading #18968

JigaoLuo · 2025-05-26T22:21:32Z

Description

For the issue #18967, this is a PR Draft aimed at removing all unnecessary synchronization points (termed "miss-sync") in the Parquet reader. Please hold off on merging this PR draft. The plan is to split it into smaller PRs for the actual merge.

TL;DR: This is the performance gain in scalability 🚀 once the future small PRs from this draft are merged:

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…in host pinned memory Signed-off-by: Jigao Luo <[email protected]>

…ion with preparing a pinned_vector Signed-off-by: Jigao Luo <[email protected]>

…nction with preparing a pinned_vector Signed-off-by: Jigao Luo <[email protected]>

Signed-off-by: Jigao Luo <[email protected]>

…-sync in parquet reading Signed-off-by: Jigao Luo <[email protected]>

…form and then reduce Signed-off-by: Jigao Luo <[email protected]>

…writing Signed-off-by: Jigao Luo <[email protected]>

Signed-off-by: Jigao Luo <[email protected]>

… in host pinned memory Signed-off-by: Jigao Luo <[email protected]>

Signed-off-by: Jigao Luo <[email protected]>

….hpp later) Signed-off-by: Jigao Luo <[email protected]>

…ater) Signed-off-by: Jigao Luo <[email protected]>

…ies.hpp later) Signed-off-by: Jigao Luo <[email protected]>

…ater) Signed-off-by: Jigao Luo <[email protected]>

…ies.hpp later) Signed-off-by: Jigao Luo <[email protected]>

…ories.hpp later) Signed-off-by: Jigao Luo <[email protected]>

…ater) Signed-off-by: Jigao Luo <[email protected]>

…ctories.hpp later) Signed-off-by: Jigao Luo <[email protected]>

…ies.hpp later) Signed-off-by: Jigao Luo <[email protected]>

Signed-off-by: Jigao Luo <[email protected]>

JigaoLuo · 2025-05-28T10:38:47Z

Hi @vuule @mhaseeb123 @wence- @GregoryKimball,
Thanks for the review and discussion! I think there is still a need for discussion. After that, I’ll address the feedback in separate PRs after we finalize the details. I’ll definitely make time to merge this in my free time.

~~I’ll also try to explore performance optimizations in benchmarks.~~ [Update: you can find the benchmark in the following message]

JigaoLuo · 2025-05-29T16:29:14Z

Hi @vuule @mhaseeb123 @wence- @GregoryKimball,
I’ve run preliminary benchmarks, and as expected, the PR draft shows significant 🚀 gains in scenarios with frequent synchronization stalls (“miss-sync”). This is particularly evident when each thread reads multiple small-sized, concurrent file segments: a use case, for example, each thread reading different row groups from a large Parquet file. If I can convince you with this use case, then we are one step closer to a SpeedOfLight Parquet reader.

I plotted my benchmark as the style from Greg's comment #15620 (comment) . The PR draft is fully compilable, so feel free to test it when you have time.

You can also find my simple command and details here:

How many miss-sync are saved?

You will see where this speedup comes from and how nasty the miss-sync can be:

Upstream

$ $ nsys export --output report_upstream_8threads.sqlite --type sqlite report_upstream_8threads.nsys-rep 
$  nsys analyze -r cuda_memcpy_async:rows=-1 report_upstream_8threads.sqlite | wc -l
166861

This PR

$ nsys analyze -r cuda_memcpy_async:rows=-1 report_pr_8threads.sqlite | wc -l
301

And all those miss-sync are coming from the Parquet writer. No pageable memcpy exists in the Parquet reader with this patch.

Command

With num_iterations=10, I let each thread read 10 times of size 128MB to mimic the use case and also to create more miss-sync.

for t in 1 2 4 8 16 32 64 128; 
do 
  ./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=$((1024 * 1024 * 128 * t)) --axis num_threads=$t --axis num_iterations=10 --csv <PATH>;
done

And I use this same command on this PR draft as well as on the upstream branch 25-08 to generate two sets of CSV files.

Hardware setup

RMM memory resource = pool
CUIO host memory resource = pinned_pool
# Devices

## [0] `NVIDIA A100-SXM4-40GB`
* SM Version: 800 (PTX Version: 800)
* Number of SMs: 108
* SM Default Clock Rate: 1410 MHz
* Global Memory: 19704 MiB Free / 40339 MiB Total
* Global Memory Bus Peak: 1555 GB/sec (5120-bit DDR @1215MHz)
* Max Shared Memory: 164 KiB/SM, 48 KiB/Block
* L2 Cache Size: 40960 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: Yes

…19020) Related to #18968 (comment) This PR updates the `batched_memset` cuIO utility to take in a `host_span` type argument instead of a `std::vector` to allow using `cudf::host_vectors` or `cudf::pinned_vectors` in the future as input. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #19020

… (Part 1: device_scalar) (#19055) For issue #18967, this PR is the first part of merging the PR Draft #18968. In this PR, `device_scalar` utilizes explicitly host pinned memory as its internal bounce buffer. Authors: - Jigao Luo (https://github.com/JigaoLuo) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Shruti Shivakumar (https://github.com/shrshi) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #19055

#19092) Contributes to #18967, part of #18968 In this PR, `hostdevice_vector::element` is removed due to its internal `cudaMemcpy` into host pageable memory. Also, the only call in it is replaced manually. Authors: - Jigao Luo (https://github.com/JigaoLuo) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) URL: #19092

…o unnecessary synchronization (Part 3 of miss-sync) (#19119) For issue #18967, this PR is one part of merging the PR Draft #18968. In this PR, almost all `rmm::device_scalar` calls in libcudf are replaced with `cudf::detail::device_scalar` due to its internal host-pinned bounce buffer. This is also a call to action to use host-pinned memory globally in libcudf, with arguments stated in #18967 and #18968. Authors: - Jigao Luo (https://github.com/JigaoLuo) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) URL: #19119

cpp/include/cudf/detail/device_scalar.hpp

cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh

cpp/include/cudf/detail/utilities/batched_memset.hpp

JigaoLuo · 2025-08-19T14:41:22Z

cpp/src/io/parquet/page_data.cu

+  // copy offsets and buff_addrs into host pinned memory
+  auto host_pinned_offsets = cudf::detail::make_pinned_vector_async<size_type>(offsets, stream);
+  auto host_pinned_buff_addrs =
+    cudf::detail::make_pinned_vector_async<size_type*>(buff_addrs, stream);


Host to Host copy (with make_pinned_vector_async)

JigaoLuo · 2025-08-19T14:46:54Z

cpp/include/cudf/reduction/detail/reduction.cuh

+  CUDF_CUDA_TRY(cudaMemcpyAsync(
+    host_scalar.data(), &initial_value, sizeof(OutputType), cudaMemcpyHostToHost, stream.value()));


Host to Host copy

cpp/include/cudf/detail/utilities/vector_factories.hpp

JigaoLuo · 2025-09-21T10:06:05Z

Note: There may be a bug in this draft or somewhere in the codebase. I ran millions of read operations over a 10-hour period—purely reading—and encountered a single instance of incorrect results, with an estimated trigger rate of just 0.0001%. I can not be sure where the bug is: is it on my draft or in the cudf code (I am using branch-25.08 still).

I’m leaving this note here as a reminder for myself, and we can consider adding more checks during the merging for the rest of this draft.

[Selfnote Update] I’ve observed the same bug even without the patch. The reproduction now seems tied to my metadata caching PR. I’ll allocate time to reproduce it more reliably.

JigaoLuo added 29 commits May 25, 2025 17:38

[No Miss Sync] Remove a miss-sync in batched_memset with duplicating …

bc8a196

…in host pinned memory Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove a miss-sync in hostdevice_vector::element funct…

9bb5399

…ion with preparing a pinned_vector Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove a miss-sync in hostdevice_vector::element funct…

14721d0

…ion with preparing a pinned_vector Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove a miss-sync due to rmm::device_scalar::value fu…

02cf20e

…nction with preparing a pinned_vector Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Update comments

a040bb8

Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove a miss-sync inside cudf::reduction::detail::reduce

6bb4263

Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Add a TODO due to BITMASK null_mask_partition_bulk test

7a75b63

Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove miss-sync from some thrust::reduce causing miss…

40c35f0

…-sync in parquet reading Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove miss-sync from some thrust::reduce causing miss…

bfee8ce

…-sync in parquet reading Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove miss-sync from thrust logical via thrust::trans…

de2c750

…form and then reduce Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove miss-sync from thrust::reduce_by_key via CUB re…

5316923

…writing Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove miss-sync from thrust::reduce_by_key via CUB re…

8c0ca98

…writing Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove miss-sync from thrust::reduce_by_key via CUB re…

ad2bd94

…writing Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove miss-sync from thrust::copy_if via CUB rewriting

c11a00e

Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove miss-sync: dummy comments

c4bfa96

Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove miss-sync due to rmm::device_uvector::element()

c9ac9fb

Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Remove miss-sync in WriteFinalOffsets with duplicating…

75bd459

… in host pinned memory Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Dummy update

59983c8

Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with cudf::detail::device_scalar

bdc814f

Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with batched_memset.hpp (and vector_factories…

87a55b2

….hpp later) Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with page_data.cu (and vector_factories.hpp l…

b109204

…ater) Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with page_string_decode.cu (and vector_factor…

24f02c6

…ies.hpp later) Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with page_data.cu (and vector_factories.hpp l…

9b3996f

…ater) Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with page_string_decode.cu (and vector_factor…

e05f53e

…ies.hpp later) Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with reader_impl_chunking.cu (and vector_fact…

993d16e

…ories.hpp later) Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with page_data.cu (and vector_factories.hpp l…

506739e

…ater) Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with reader_impl_preprocess.cu (and vector_fa…

a99869f

…ctories.hpp later) Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with hostdevice_vector.hpp (and vector_factor…

1ec74c9

…ies.hpp later) Signed-off-by: Jigao Luo <[email protected]>

[No Miss Sync] Refactor with cuda_memcpy.hpp and vector_factories.hpp

ca71572

Signed-off-by: Jigao Luo <[email protected]>

JigaoLuo requested a review from a team as a code owner May 26, 2025 22:21

mhaseeb123 mentioned this pull request May 28, 2025

[BUG] Compiler segmentation fault when calling make_host_vector in certain cases. #18980

Closed

mhaseeb123 mentioned this pull request May 28, 2025

batched_memset to use a host_span arg instead of std::vector #19020

Merged

3 tasks

JigaoLuo mentioned this pull request May 29, 2025

[FEA] Unstable pipelining performance in Parquet reading due to "miss-sync" #18967

Open

JigaoLuo mentioned this pull request May 31, 2025

Remove unnecessary synchronization (miss-sync) during Parquet reading (Part 1: device_scalar) #19055

Merged

3 tasks

JigaoLuo changed the title ~~Remove unnecessary synchronization (miss-sync) during Parquet reading~~ [DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading Jun 1, 2025

JigaoLuo mentioned this pull request Jun 4, 2025

Remove hostdevice_vector::element due to unnecessary synchronization #19092

Merged

3 tasks

mhaseeb123 mentioned this pull request Jun 5, 2025

Apply primitive row operators into hash join #18896

Merged

3 tasks

mhaseeb123 added this to the Parquet continuous improvement milestone Jun 13, 2025

JigaoLuo mentioned this pull request Jun 17, 2025

[FEA] Support specifying a host mr for rmm::device_scalar rapidsai/rmm#1959

Open

JigaoLuo mentioned this pull request Jul 27, 2025

[ 🚧 Draft] : Adding host-mr for pinned bounce buffer to rmm::device_buffer rapidsai/rmm#1996

Draft

3 tasks

This comment was marked as outdated.

Sign in to view

JigaoLuo commented Aug 13, 2025

View reviewed changes

cpp/include/cudf/detail/device_scalar.hpp Show resolved Hide resolved

JigaoLuo commented Aug 13, 2025

View reviewed changes

cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh Show resolved Hide resolved

JigaoLuo commented Aug 13, 2025

View reviewed changes

cpp/include/cudf/detail/utilities/batched_memset.hpp Show resolved Hide resolved

JigaoLuo commented Aug 19, 2025

View reviewed changes

mhaseeb123 removed this from the Parquet continuous improvement milestone Aug 20, 2025

JigaoLuo commented Sep 2, 2025

View reviewed changes

cpp/include/cudf/detail/utilities/vector_factories.hpp Show resolved Hide resolved

JigaoLuo mentioned this pull request Sep 26, 2025

Remove unnecessary synchronization (miss-sync) during Parquet reading (Part 4: vector_factories) #20120

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading #18968

[DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading #18968

Uh oh!

JigaoLuo commented May 26, 2025 •

edited

Loading

Uh oh!

JigaoLuo commented May 28, 2025 •

edited

Loading

Uh oh!

JigaoLuo commented May 29, 2025 •

edited

Loading

How many miss-sync are saved?

Upstream

This PR

Command

Hardware setup

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

JigaoLuo Aug 19, 2025 •

edited

Loading

Uh oh!

JigaoLuo Aug 19, 2025

Uh oh!

Uh oh!

JigaoLuo commented Sep 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		CUDF_CUDA_TRY(cudaMemcpyAsync(
		host_scalar.data(), &initial_value, sizeof(OutputType), cudaMemcpyHostToHost, stream.value()));

[DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading #18968

Are you sure you want to change the base?

[DO NOT MERGE] Remove unnecessary synchronization (miss-sync) during Parquet reading #18968

Uh oh!

Conversation

JigaoLuo commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

JigaoLuo commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JigaoLuo commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How many miss-sync are saved?

Upstream

This PR

Command

Hardware setup

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

JigaoLuo Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JigaoLuo Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JigaoLuo commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JigaoLuo commented May 26, 2025 •

edited

Loading

JigaoLuo commented May 28, 2025 •

edited

Loading

JigaoLuo commented May 29, 2025 •

edited

Loading

JigaoLuo Aug 19, 2025 •

edited

Loading

JigaoLuo commented Sep 21, 2025 •

edited

Loading