[FEA] Unstable pipelining performance in Parquet reading due to "mis-sync"

This is part of my ongoing series of studies on the Parquet reader https://github.com/rapidsai/cudf/issues/18892. Through my analysis, I’ve identified a key observation when reading Parquet files: unstable performance with pipelining and multistreams. The root cause of performance instability was unidentified. This instability (termed "pipeline bubbles" by the cuDF team) is also observed in Velox-CuDF TableScan and all engines built on top of libcudf.

In this issue, I will explain ***"mis-sync" (mis synchronization or unnecessary synchronization)*** as the root cause of the instability (or "pipeline bubbles"). This issue and its following PRs are critical for enhancing performance in libcudf and all downstream applications on top of libcudf. This issue documents all relevant context and know-how, so subsequent PRs can focus strictly on fixes.

<details>

<summary>To the cuDF team:</summary>

I’m confident that this issue, not any computation kernels, is the bottleneck in the Parquet reader. And my following PR will resolve it to fully unleash the GPU computational power in the Parquet reader. All unit tests pass locally except for the previously mentioned issue https://github.com/rapidsai/cudf/issues/18455.

I will first do a huge PR draft with all my changes. But given that most synchronization gaps occur in unrelated contexts, I can fragment to have multiple small PRs, each targeting a single mis-sync instance. I also have refactored my previous most-CUB-code to reuse existing components from `cudf::reduction::detail::reduce` to minimize redundancy. Your feedback and review are welcome!

</details>

---

# 1. What is mis-sync?

TL;DR: Host-pageable memory is not your friend:

*Mis-sync* refers to unnecessary CUDA synchronization triggered by `cudaMemcpy` involving host-pageable memory. As highlighted in my motivation above (and Velox-Cudf’s TableScan profiling), mis-sync is the primary bottleneck stifling GPU efficiency in the Parquet reader. Unnecessary stream synchronization during memcpy operations disrupts GPU compute pipelines, introduces latency & computation-hiccups, and finally causes instability in end-to-end query execution.

<details>

<summary>`cudaMemcpy` with pageable memory in CUDA documentation:</summary>

Per the CUDA documentation, such operations implicitly introduce synchronization points **implicitly**:
> Memcpy Asynchronous
> 1. For transfers between device memory and pageable host memory, the function might be synchronous with respect to host.

From: https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

</details>

# 2. Where does mis-sync comes from?

TL;DR: Thrust is not your friend.

Profiling reveals Thrust as a primary source of mis-sync in my study of Parquet reader. If a Thrust function has a return value, it will store return values in a stack variable, which is host-pageable memory. Then this return inside the Thrust function forces implicit CUDA synchronization during memcpy operations. As of now, no (straightforward) method exists to redirect Thrust’s output to host-pinned memory before the function returns. You can read here: https://github.com/NVIDIA/cccl/discussions/4317

**Besides Thrust, there are additional sources of mis-sync:**
- `device_uvector::element`
- `cudaMemcpy` involving `std::vector` (which is heap-allocated via default C++ allocator)
- ...

# 3. How to eliminate mis-sync?

TL;DR:  CUB is your friend to go multistream and pipelining.

To address Thrust-Related mis-sync: Replace Thrust with CUB. CUB’s design ensures all memory allocations are explicitly managed outside CUB and then passed into CUB-kernel, eliminating implicit host-pageable memory copies that cause mis-sync. 
Note: The goal of rewriting with CUB is not its low-level nature or lightweight design, but **its explicit memory management** - a capability Thrust lacks.

To address other mis-sync: For non-Thrust cases, duplicate host-pageable memory into host-pinned memory before GPU operations to avoid implicit pageable memory copies. 

I hope to update [DEVELOPER_GUIDE.md](https://github.com/rapidsai/cudf/blob/branch-25.08/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#memory-allocation) - once my PRs are complete, to document strategies for avoiding mis-sync. Many nice functions are ready, thanks to e.g. `cudf::detail::make_pinned_vector_async` with `cudf::get_pinned_memory_resource()`, but they are currently underexplained or missing from the DEVELOPER_GUIDE. This will be helpful for external contributors like me.

DEVELOPER_GUIDE:
- `device_scalar` in #19119
- more TODO

# 4. How to detect the source LOC of mis-sync?

TL;DR: nsys is your friend, especially, Expert Systems View.

In Nsight Systems, all of the CUDA functions are traced via a callback every time the function is called. So we can see every instance of mis-sync via Expert Systems View. 
- I currently detect mis-sync by first using nsys Expert Systems View to identify calls, narrowing down to LOC with NVTX in my way, and doing fixes in `3. How to eliminate mis-sync?`. But this manual way does not scale.
- To scale this, I wonder if this could be automated, as I know the Expert Systems View is based on sqlite database could be queried. Then, if everything is automated in bash, I also wonder if we can introduce this mis-sync check in CI. 

(TODO: more check in CI)

<details>

<summary>Example: Expert Systems View in Nsys:</summary>

![expert systems view](https://global.discourse-cdn.com/nvidia/original/4X/b/8/c/b8c997e915b5b5cae11d7039efecd5f79515001c.png)

From: https://forums.developer.nvidia.com/t/question-identifying-all-instances-of-cudamemcpy-with-pageable-host-memory/329977/2?u=user157267

And there is a `sqlite` that could be queried:
```
$ nsys analyze -r cuda_memcpy_async "/tmp/report15.sqlite"
Processing [/tmp/report15.sqlite] with [/opt/nsight-systems-20250101/host-linux-x64/rules/cuda_memcpy_async.py]... 
 ** CUDA Async Memcpy with Pageable Memory (cuda_memcpy_async):
There were no problems detected related to memcpy operations using pageable memory.
```

</details>

---

I structured the four-step explanation to highlight the mis-sync concept, and these issues are project-agnostic and can arise anywhere without rigorous profiling. While I couldn’t address all potential mis-sync due to limited manpower, I’ve fixed **all issues inside Parquet reader**, mostly by reusing `cudf::reduction::detail::reduce` as a foundation. I’d appreciate your feedback on this. You can also find the performance speedup in this benchmark: https://github.com/rapidsai/cudf/pull/18968#issuecomment-2919937779

I suspect mis-sync issue may exist in cudf database operators, as I still see `thrust::reduce` in other files. A comprehensive (and tedious) rewrite to address these across the codebase would be beneficial. That’s why I structured the four-step explanation—to provide a reusable resource for future CUB rewrites.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Unstable pipelining performance in Parquet reading due to "mis-sync" #18967

1. What is mis-sync?

2. Where does mis-sync comes from?

3. How to eliminate mis-sync?

4. How to detect the source LOC of mis-sync?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Unstable pipelining performance in Parquet reading due to "mis-sync" #18967

Description

1. What is mis-sync?

2. Where does mis-sync comes from?

3. How to eliminate mis-sync?

4. How to detect the source LOC of mis-sync?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions