Skip to content

Releases: NVIDIA/cccl

python-0.3.1

08 Oct 22:05
1635445
Compare
Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.3.1, dated October 8th, 2025. The previous release was v0.3.0.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

  • The cuda.cccl.parallel.experimental package has been renamed to cuda.compute.
  • The cuda.cccl.cooperative.experimental package has been renamed to cuda.coop.
  • The old imports will continue to work for now, but will be removed in a subsequent release.
  • Documentation at https://nvidia.github.io/cccl/python/ has been updated to reflect these changes.

Bug Fixes

Breaking Changes

  • If you previously were importing subpackages of cuda.cccl.parallel.experimental or cuda.cccl.cooperative.experimental, those imports may not work as expected. Please import from cuda.compute and cuda.coop respectively.

v3.0.3

07 Oct 00:34
8c04b65
Compare
Choose a tag to compare

What's Changed

🔄 Other Changes

  • Backport #5442 to branch/3.0x by @shwina in #5469
  • Backport to 3.0: Fix grid dependency sync in cub::DeviceMergeSort (#5456) by @bernhardmgruber in #5461
  • Partial backport to 3.0: Fix SMEM alignment in DeviceTransform by @bernhardmgruber in #5463
  • [Version] Update branch/3.0.x to v3.0.3 by @github-actions[bot] in #5502
  • [Backport branch/3.0.x] NV_TARGET and cuda::ptx for CTK 13 by @fbusato in #5481
  • [BACKPORT 3.0]: Update PTX ISA version for CUDA 13 (#5676) by @miscco in #5700
  • Backport some MSVC test fixes to 3.0 by @miscco in #5819
  • [Backport 3.0]: Work around submdspan compiler issue on MSVC (#5885) by @miscco in #5903
  • Backport pin of llvmlite dependency to branch/3.0x by @shwina in #6000
  • [Backport branch/3.0.x] Ensure that we are actually calling the cuda APIs ... (#4570) by @davebayer in #6098
  • [Backport to 3.0] add a specialization of __make_tuple_types for complex<T> (#6102) by @davebayer in #6117
  • [Backport 3.0.x] Use proper qualification in allocate.h (#4796) by @wmaxey in #6126

Full Changelog: v3.0.2...v3.0.3

CCCL Python Libraries (v0.3.0)

14 Oct 14:42
6c98014
Compare
Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.3.0, dated October 2nd, 2025. The previous release was v0.2.1.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

  • ARM64 wheel and conda package support: Installation via pip and conda now supported on ARM64 (aarch64) architecture.

  • New algorithm: three-way partitioning: The three_way_partition algorithm enables partitioning an array (or iterator) into three partitions, given two selection operators.

  • Improved scan performance: The inclusive_scan and exclusive_scan APIs provide improved performance by automatically selecting the optimal tuning for the input data types and device architecture.

Bug Fixes

None.

Breaking Changes

None.

CCCL Python Libraries v0.1.3.2.0.dev128 (pre-release)

19 Aug 21:56
0c78770
Compare
Choose a tag to compare

These are the changes in the cuda.cccl libraries introduced in the pre-release 0.1.3.2.0dev128 dated August 14th, 2025.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Major API improvements

Single-call APIs in cuda.cccl.parallel

Previously, performing operation like reduce_into required 4 API invocations to
(1) create a reducer object, (2) compute the amount of temporary storage required for the reduction,
(3) allocate the required amount of temporary memory, and (4) perform the reduction.

In this version, cuda.cccl.parallel introduces simpler, single-call APIs. For example, reduction looks like:

# New API - single function call with automatic temp storage
parallel.reduce_into(d_input, d_output, add_op, num_items, h_init)

If you wish to have more control over temporary memory allocation,
the previous API still exists (and always will). It has been renamed from reduce_into to make_reduce_into:

# Object API
reducer = parallel.make_reduce_into(d_input, d_output, add_op, h_init)
temp_storage_size = reducer(None, d_input, d_output, num_items, h_init)
temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
reducer(temp_storage, d_input, d_output, num_items, h_init)

New algorithms

Device-wide histogram

The histogram_even
function provides Python exposure of the corresponding CUB C++ API DeviceHistogram::HistogramEven.

StripedtoBlock exchange

cuda.cccl.cooperative adds a block.exchange
providing Python exposure of the corresponding CUB C++ API BlockExchange.
Currently, only the StripedToBlock exchange pattern is supported.

Infrastructure improvements

CuPy dependency replaced with cuda.core

Use of CuPy within the library has been replaced with the lighter weight cuda.core
package. This means that installing cuda.cccl won't install CuPy as a dependency.

Support for CUDA 13 drivers

cuda.cccl can be used with CUDA 13 compatible drivers. However, the CUDA 13 toolkit (runtime and libraries) is not
yet supported, meaning you still need the CUDA 12 toolkit. Full support for CUDA 13 toolkit is planned for the next
pre-release.

v3.0.2

29 Jul 23:11
9c40ed1
Compare
Choose a tag to compare

What's Changed

🔄 Other Changes

Full Changelog: v3.0.1...v3.0.2

v3.0.1

23 Jul 17:39
f19d875
Compare
Choose a tag to compare

What's Changed

🔄 Other Changes

  • [Version] Update branch/3.0.x to v3.0.1 by @github-actions[bot] in #5256
  • [Backport branch/3.0.x] Disable assertions for QNX, they do not provide the machinery with their libc by @github-actions[bot] in #5258
  • [BACKPORT 3.0] Make sure that nested tuple and pair have the expected size (#5246) by @miscco in #5265
  • [BACKPORT] Add missed specializations of the new aligned vector types to cub (#5264) by @miscco in #5271
  • [BACKPORT 3.0] Backport diagnostic suppression machinery by @miscco in #5281

Full Changelog: v3.0.0...v3.0.1

v3.0.0

16 Jul 17:45
e944297
Compare
Choose a tag to compare

CCCL 3.0 Release

The 3.0 release of the CUDA Core Compute Libraries (CCCL) marks our first major version since unifying the Thrust, CUB, and libcudacxx libraries under a single repository. This release reflects over a year of work focused on cleanup, consolidation, and modernizing the codebase to support future growth.

While this release includes a number of breaking changes, many involve the consolidation of APIs—particularly in the thrust:: and cub:: namespaces—as well as cleanup of internal details that were never intended for public use. In many cases, redundant functionality from thrust:: or cub:: has been replaced with equivalent or improved abstractions from the cuda:: or cuda::std:: namespaces. Impact should be minimal for most users. For full details and recommended migration steps, please consult the CCCL 2.x to 3.0 Migration Guide.

Key Changes in CCCL 3.0

Requirements

  • C++17 or newer is now required (support for C++11 and C++14 has been dropped #3255)
  • CUDA Toolkit 12.0+ is now required (support for CTK 11.0+ has been dropped). For details on version compatibility, see the README.
  • Compilers:
    • GCC 7+ (support for GCC < 7 has been dropped #3268)
    • Clang 14+ (support for Clang < 14 has been dropped #3309)
    • MSVC 2019+ (support for MSVC 2017 has been dropped #3287, #3553)
  • Dropped support for

Header Directory Changes in CUDA Toolkit 13.0

CCCL 3.0 will be included with an upcoming CUDA Toolkit 13.0 release. In this release, the bundled CCCL headers have moved to new top-level directories under ${CTK_ROOT}/include/cccl/.

Before CUDA 13.0 After CUDA 13.0
${CTK_ROOT}/include/cuda/ ${CTK_ROOT}/include/cccl/cuda/
${CTK_ROOT}/include/cub/ ${CTK_ROOT}/include/cccl/cub/
${CTK_ROOT}/include/thrust/ ${CTK_ROOT}/include/cccl/thrust/

These changes only affect the on-disk location of CCCL headers within the CUDA Toolkit installation.

What you need to know

  • ❌ Do NOT write #include <cccl/...> — this will break.
  • If using CCCL headers only in files compiled with nvcc
    • ✅ No action needed. This is the default for most users.
  • If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
    • Using CMake and linking CCCL::CCCL
      • ✅ No action needed. (This is the recommended path. See example)
    • Other build systems
      • ⚠️ Add ${CTK_ROOT}/include/cccl to your compiler’s include search path (e.g., with -I)

These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.

Major API Changes

Hundreds of macros, internal types, and implementation details were removed or relocated to internal namespaces. This significantly reduces surface area and eliminates long-standing technical debt, improving both compile times and maintainability.

Removed Macros

Over 50 legacy macros have been removed in favor of modern C++ alternatives:

Removed Functions and Classes

  • thrust::optional: use cuda::std::optional instead #4172
  • thrust::tuple: use cuda::std::tuple instead #2395
  • thrust::pair: use cuda::std::pair instead #2395
  • thrust::numeric_limits: use cuda::std::numeric_limits instead #3366
  • cub::BFE: use `cuda::bitfield_inser`t and cuda::bitfield_extract instead #4031
  • cub::ConstantInputIterator: use thrust::constant_iterator instead #3831
  • cub::CountingInputIterator: use thrust::counting_iterator instead #3831
  • cub::GridBarrier: use cooperative groups instead #3745
  • cub::DeviceSpmv: use cuSPARSE instead #3320
  • cub::Mutex: use cuda::std::mutex instead #3251
  • See CCCL 2.x to 3.0 Migration Guide for complete list

New Features

C++

cuda::

  • cuda::std::numeric_limits now supports __float128 #4059
  • cuda::std::optional<T&> implementation (P2988) #3631
  • cuda::std::numbers header for mathematical constants #3355
  • NVFP8/6/4 extended floating-point types support in <cuda/std/cmath> #3843
  • cuda::overflow_cast for safe numeric conversions #4151
  • cuda::ilog2 and cuda::ilog10 integer logarithms #4100
  • cuda::round_up and cuda::round_down utilities #3234

cub::

  • `cub::DeviceSegmentedReduce` now supports large number of segments #3746
  • `cub::DeviceCopy::Batched` now supports large number of buffers #4129
  • `cub::DeviceMemcpy::Batched` now supports large number of buffers #4065

thrust::

  • New `thrust::offset_iterator` iterator #4073
  • Temporary storage allocations in parallel algorithms now respect `par_nosync` #4204

Python

CUDA Python Core Libraries are now available on PyPI through the cuda-cccl package.

pip install cuda-cccl

cuda.cccl.cooperative

  • Block-level sorting now supports multi-dimensional thread blocks #4035, #4028
  • Block-level data movement now supports multi-dimensional thread blocks #3161
  • New block-level inclusive sum algorithm #3921

cuda.cccl.parallel

  • New device-level segmented-reduce algorithm #3906
  • New device-level unique-by-key algorithm #3947
  • New device-level merge-sort algorithm #3763

What's Changed

🚀 Thrust / CUB

Read more

v2.8.5

30 May 22:06
d108fb0
Compare
Choose a tag to compare

What's Changed

  • Avoid plain assert in device code by @miscco in #4707
  • Do not use open-coded INFINITY for tests that also test extended floating points by @miscco in #4744
  • [Version] Update branch/2.8.x to v2.8.5 by @github-actions in #4755
  • [Backport branch/2.8.x] Update Blackwell PTX instruction availability tables by @github-actions in #3900

Full Changelog: v2.8.4...v2.8.5

v2.8.4

09 May 19:12
e80fa6c
Compare
Choose a tag to compare

What's Changed

  • [BACKPORT] Do not use pack indexing with clang-19 by @miscco in #4447
  • [Backport branch/2.8.x] Always bypass automatic atomic storage checks to prevent potential compiler issues by @github-actions in #4616
  • [Version] Update branch/2.8.x to v2.8.4 by @github-actions in #4655

Full Changelog: v2.8.3...v2.8.4

v2.8.3

12 Apr 18:03
0d328e0
Compare
Choose a tag to compare

What's Changed

  • [BACKPORT: 2.8] Set NO_CMAKE_FIND_ROOT_PATH for cudax. (#4162) by @miscco in #4216
  • [BACKPORT 2.8] Fix the cuda python setup by @miscco in #4218
  • Backport PR #4221 to branch/2.8.x — Remove python/cuda_cooperative/setup.py by @rwgk in #4235
  • [Backport branch/2.8.x] Remove invalid single # in builtin.h by @github-actions in #4326
  • [BACKPORT 2.8] Allow rapids to avoid unrolling some loops in sort (#4253) by @miscco in #4387
  • [Backport branch/2.8.x] Fix uninitialized read in local atomic code path. by @github-actions in #4424
  • [Version] Update branch/2.8.x to v2.8.3 by @github-actions in #4423

Full Changelog: v2.8.2...v2.8.3