Understanding the Algorithm-Level Implementation of Scan, Reduce, Sort in CUB #4530

Wilson1211 · 2025-04-23T15:50:26Z

Wilson1211
Apr 23, 2025

I would like to check if my understanding of the structure and implementation of CUB's scan, reduce, and sort operations is correct. I’ll use scan as an example.

After tracing the code starting from ExclusiveSum in device_scan.cuh, the call chain appears to be:
ExclusiveSum -> DispatchScan::Dispatch -> Invoke -> Invokepasses ->... and so on.

From what I can tell, this part handles how the workload is dispatched to warps and threads.

However, what I'm particularly interested in is the high-level algorithm used to implement scan — the actual logic behind the scan itself, not just how it's distributed.

As shown in this picture.

What I’ve Found So Far
I noticed that when doing DispatchScan::Dispatch(...), in which will call a DispatchScan object dispatch, which will eventually call invoke(..., kernel_source.ScanKernel(), ...), which will use the function launcher_factory to launch the kernel

        launcher_factory(scan_grid_size, policy.Scan().BlockThreads(), 0, stream)
          .doit(scan_kernel, d_in, d_out, tile_state, start_tile, scan_op, init_value, num_items);

My current guess is that ScanKernel() ultimately calls cccl_device_scan(), which is defined in
In CCCL/c/parallel/src/scan.cu

template <cub::ForceInclusive EnforceInclusive>
CUresult cccl_device_scan(
  cccl_device_scan_build_result_t build,
  void* d_temp_storage,
  size_t* temp_storage_bytes,
  cccl_iterator_t d_in,
  cccl_iterator_t d_out,
  uint64_t num_items,
  cccl_op_t op,
  cccl_value_t init,
  CUstream stream)
{
  bool pushed    = false;
  CUresult error = CUDA_SUCCESS;
  try
  {
    pushed = try_push_context();

    CUdevice cu_device;
    check(cuCtxGetDevice(&cu_device));
    auto cuda_error = cub::DispatchScan<
      indirect_arg_t,
      indirect_arg_t,
      indirect_arg_t,
      indirect_arg_t,
      ::cuda::std::size_t,
      void,
      EnforceInclusive,
      scan::dynamic_scan_policy_t<&scan::get_policy>,
      scan::scan_kernel_source,
      cub::detail::CudaDriverLauncherFactory>::
      Dispatch(
      
              d_temp_storage,
        *temp_storage_bytes,
        d_in,
        d_out,
        op,
        init,
        num_items,
        stream,
        {build},
        cub::detail::CudaDriverLauncherFactory{cu_device, build.cc},
        {scan::get_accumulator_type(op, d_in, init)});
    if (cuda_error != cudaSuccess)
    {
      const char* errorString = cudaGetErrorString(cuda_error); // Get the error string
      std::cerr << "CUDA error: " << errorString << std::endl;
    }
  }
  catch (const std::exception& exc)
  {
    fflush(stderr);
    printf("\nEXCEPTION in cccl_device_scan(): %s\n", exc.what());
    fflush(stdout);
    error = CUDA_ERROR_UNKNOWN;
  }
  if (pushed)
  {
    CUcontext cu_context;
    cuCtxPopCurrent(&cu_context);
  }
  return error;
}

I'm wondering if this is the location where the high-level algorithm is implemented, or it's deeper? If it's deeper, are there any resources similar to https://nvidia.github.io/cccl/cub/developer_overview.html that explain the algorithm-level structure of scan, reduce, and sort in CUB?

If I’ve misunderstood or traced the wrong parts of the code, I’d really appreciate any guidance you could provide.

Thank you so much!

Answered by pauleonix

Apr 23, 2025

The high level algorithm is described in Single-pass Parallel Prefix Scan with Decoupled Look-back (PDF) and an overview and more recent optimizations are explained in Scan at the Speed of Light (YouTube).

View full answer

shwina · 2025-04-23T16:00:43Z

shwina
Apr 23, 2025
Collaborator

Nice analysis so far! I will let a CUB developer with more experience answer your question in more detail, but I did want to make a quick correction that will perhaps orient you in the right direction:

My current guess is that ScanKernel() ultimately calls cccl_device_scan(), which is defined in
In CCCL/c/parallel/src/scan.cu

Everything in c/parallel/ currently primarily exists to support the experimental Python library cuda.parallel. If you're using CUB purely from C++, none of the code in c/parallel is particularly relevant.

If you're using CUB from C++, ScanKernel() will call DeviceScanKernel defined here.

0 replies

pauleonix · 2025-04-23T16:03:52Z

pauleonix
Apr 23, 2025
Collaborator

The high level algorithm is described in Single-pass Parallel Prefix Scan with Decoupled Look-back (PDF) and an overview and more recent optimizations are explained in Scan at the Speed of Light (YouTube).

1 reply

pauleonix Apr 23, 2025
Collaborator

Furthermore there is a similar video about radix sort implementation and optimizations in CUB: A Faster Radix Sort Implementation (NVIDIA On-Demand).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding the Algorithm-Level Implementation of Scan, Reduce, Sort in CUB #4530

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Understanding the Algorithm-Level Implementation of Scan, Reduce, Sort in CUB #4530

Uh oh!

Wilson1211 Apr 23, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

shwina Apr 23, 2025 Collaborator

Uh oh!

Uh oh!

pauleonix Apr 23, 2025 Collaborator

Uh oh!

Uh oh!

pauleonix Apr 23, 2025 Collaborator

Wilson1211
Apr 23, 2025

Replies: 2 comments 1 reply

shwina
Apr 23, 2025
Collaborator

pauleonix
Apr 23, 2025
Collaborator

pauleonix Apr 23, 2025
Collaborator