Understanding the Algorithm-Level Implementation of Scan, Reduce, Sort in CUB #4530
-
I would like to check if my understanding of the structure and implementation of CUB's scan, reduce, and sort operations is correct. I’ll use scan as an example. After tracing the code starting from ExclusiveSum in device_scan.cuh, the call chain appears to be: From what I can tell, this part handles how the workload is dispatched to warps and threads. However, what I'm particularly interested in is the high-level algorithm used to implement scan — the actual logic behind the scan itself, not just how it's distributed. What I’ve Found So Far
My current guess is that
I'm wondering if this is the location where the high-level algorithm is implemented, or it's deeper? If it's deeper, are there any resources similar to https://nvidia.github.io/cccl/cub/developer_overview.html that explain the algorithm-level structure of If I’ve misunderstood or traced the wrong parts of the code, I’d really appreciate any guidance you could provide. Thank you so much! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Nice analysis so far! I will let a CUB developer with more experience answer your question in more detail, but I did want to make a quick correction that will perhaps orient you in the right direction:
Everything in If you're using CUB from C++, |
Beta Was this translation helpful? Give feedback.
-
The high level algorithm is described in Single-pass Parallel Prefix Scan with Decoupled Look-back (PDF) and an overview and more recent optimizations are explained in Scan at the Speed of Light (YouTube). |
Beta Was this translation helpful? Give feedback.
The high level algorithm is described in Single-pass Parallel Prefix Scan with Decoupled Look-back (PDF) and an overview and more recent optimizations are explained in Scan at the Speed of Light (YouTube).