Releases: NVIDIA/cccl
python-0.3.1
These are the release notes for the cuda-cccl
Python package version 0.3.1, dated October 8th, 2025. The previous release was v0.3.0.
cuda.cccl
is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
- The
cuda.cccl.parallel.experimental
package has been renamed tocuda.compute
. - The
cuda.cccl.cooperative.experimental
package has been renamed tocuda.coop
. - The old imports will continue to work for now, but will be removed in a subsequent release.
- Documentation at https://nvidia.github.io/cccl/python/ has been updated to reflect these changes.
Bug Fixes
Breaking Changes
- If you previously were importing subpackages of
cuda.cccl.parallel.experimental
orcuda.cccl.cooperative.experimental
, those imports may not work as expected. Please import fromcuda.compute
andcuda.coop
respectively.
v3.0.3
What's Changed
🔄 Other Changes
- Backport #5442 to branch/3.0x by @shwina in #5469
- Backport to 3.0: Fix grid dependency sync in cub::DeviceMergeSort (#5456) by @bernhardmgruber in #5461
- Partial backport to 3.0: Fix SMEM alignment in DeviceTransform by @bernhardmgruber in #5463
- [Version] Update branch/3.0.x to v3.0.3 by @github-actions[bot] in #5502
- [Backport branch/3.0.x] NV_TARGET and cuda::ptx for CTK 13 by @fbusato in #5481
- [BACKPORT 3.0]: Update PTX ISA version for CUDA 13 (#5676) by @miscco in #5700
- Backport some MSVC test fixes to 3.0 by @miscco in #5819
- [Backport 3.0]: Work around
submdspan
compiler issue on MSVC (#5885) by @miscco in #5903 - Backport pin of llvmlite dependency to branch/3.0x by @shwina in #6000
- [Backport branch/3.0.x] Ensure that we are actually calling the cuda APIs ... (#4570) by @davebayer in #6098
- [Backport to 3.0] add a specialization of
__make_tuple_types
forcomplex<T>
(#6102) by @davebayer in #6117 - [Backport 3.0.x] Use proper qualification in allocate.h (#4796) by @wmaxey in #6126
Full Changelog: v3.0.2...v3.0.3
CCCL Python Libraries (v0.3.0)
These are the release notes for the cuda-cccl
Python package version 0.3.0, dated October 2nd, 2025. The previous release was v0.2.1.
cuda.cccl
is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
-
ARM64 wheel and conda package support: Installation via
pip
andconda
now supported on ARM64 (aarch64) architecture. -
New algorithm: three-way partitioning: The
three_way_partition
algorithm enables partitioning an array (or iterator) into three partitions, given two selection operators. -
Improved scan performance: The
inclusive_scan
andexclusive_scan
APIs provide improved performance by automatically selecting the optimal tuning for the input data types and device architecture.
Bug Fixes
None.
Breaking Changes
None.
CCCL Python Libraries v0.1.3.2.0.dev128 (pre-release)
These are the changes in the cuda.cccl
libraries introduced in the pre-release 0.1.3.2.0dev128 dated August 14th, 2025.
cuda.cccl
is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Major API improvements
Single-call APIs in cuda.cccl.parallel
Previously, performing operation like reduce_into
required 4 API invocations to
(1) create a reducer object, (2) compute the amount of temporary storage required for the reduction,
(3) allocate the required amount of temporary memory, and (4) perform the reduction.
In this version, cuda.cccl.parallel
introduces simpler, single-call APIs. For example, reduction looks like:
# New API - single function call with automatic temp storage
parallel.reduce_into(d_input, d_output, add_op, num_items, h_init)
If you wish to have more control over temporary memory allocation,
the previous API still exists (and always will). It has been renamed from reduce_into
to make_reduce_into
:
# Object API
reducer = parallel.make_reduce_into(d_input, d_output, add_op, h_init)
temp_storage_size = reducer(None, d_input, d_output, num_items, h_init)
temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
reducer(temp_storage, d_input, d_output, num_items, h_init)
New algorithms
Device-wide histogram
The histogram_even
function provides Python exposure of the corresponding CUB C++ API DeviceHistogram::HistogramEven
.
StripedtoBlock
exchange
cuda.cccl.cooperative
adds a block.exchange
providing Python exposure of the corresponding CUB C++ API BlockExchange.
Currently, only the StripedToBlock
exchange pattern is supported.
Infrastructure improvements
CuPy dependency replaced with cuda.core
Use of CuPy within the library has been replaced with the lighter weight cuda.core
package. This means that installing cuda.cccl
won't install CuPy as a dependency.
Support for CUDA 13 drivers
cuda.cccl
can be used with CUDA 13 compatible drivers. However, the CUDA 13 toolkit (runtime and libraries) is not
yet supported, meaning you still need the CUDA 12 toolkit. Full support for CUDA 13 toolkit is planned for the next
pre-release.
v3.0.2
What's Changed
🔄 Other Changes
- [Version] Update branch/3.0.x to v3.0.2 by @github-actions[bot] in #5348
- Backport to 3.0: Add a macro to disable PDL (#5316) by @bernhardmgruber in #5330
- [Backport branch/3.0x] Add gitlab devcontainers (#5325) by @wmaxey in #5352
Full Changelog: v3.0.1...v3.0.2
v3.0.1
What's Changed
🔄 Other Changes
- [Version] Update branch/3.0.x to v3.0.1 by @github-actions[bot] in #5256
- [Backport branch/3.0.x] Disable assertions for QNX, they do not provide the machinery with their libc by @github-actions[bot] in #5258
- [BACKPORT 3.0] Make sure that nested
tuple
andpair
have the expected size (#5246) by @miscco in #5265 - [BACKPORT] Add missed specializations of the new aligned vector types to cub (#5264) by @miscco in #5271
- [BACKPORT 3.0] Backport diagnostic suppression machinery by @miscco in #5281
Full Changelog: v3.0.0...v3.0.1
v3.0.0
CCCL 3.0 Release
The 3.0 release of the CUDA Core Compute Libraries (CCCL) marks our first major version since unifying the Thrust, CUB, and libcudacxx libraries under a single repository. This release reflects over a year of work focused on cleanup, consolidation, and modernizing the codebase to support future growth.
While this release includes a number of breaking changes, many involve the consolidation of APIs—particularly in the thrust::
and cub::
namespaces—as well as cleanup of internal details that were never intended for public use. In many cases, redundant functionality from thrust::
or cub::
has been replaced with equivalent or improved abstractions from the cuda::
or cuda::std::
namespaces. Impact should be minimal for most users. For full details and recommended migration steps, please consult the CCCL 2.x to 3.0 Migration Guide.
Key Changes in CCCL 3.0
Requirements
- C++17 or newer is now required (support for C++11 and C++14 has been dropped #3255)
- CUDA Toolkit 12.0+ is now required (support for CTK 11.0+ has been dropped). For details on version compatibility, see the README.
- Compilers:
- Dropped support for
Header Directory Changes in CUDA Toolkit 13.0
CCCL 3.0 will be included with an upcoming CUDA Toolkit 13.0 release. In this release, the bundled CCCL headers have moved to new top-level directories under ${CTK_ROOT}/include/cccl/.
Before CUDA 13.0 | After CUDA 13.0 |
---|---|
${CTK_ROOT}/include/cuda/ |
${CTK_ROOT}/include/cccl/cuda/ |
${CTK_ROOT}/include/cub/ |
${CTK_ROOT}/include/cccl/cub/ |
${CTK_ROOT}/include/thrust/ |
${CTK_ROOT}/include/cccl/thrust/ |
These changes only affect the on-disk location of CCCL headers within the CUDA Toolkit installation.
What you need to know
- ❌ Do NOT write
#include <cccl/...>
— this will break. - If using CCCL headers only in files compiled with nvcc
- ✅ No action needed. This is the default for most users.
- If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
- Using CMake and linking
CCCL::CCCL
- ✅ No action needed. (This is the recommended path. See example)
- Other build systems
⚠️ Add${CTK_ROOT}/include/cccl
to your compiler’s include search path (e.g., with-I
)
- Using CMake and linking
These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.
Major API Changes
Hundreds of macros, internal types, and implementation details were removed or relocated to internal namespaces. This significantly reduces surface area and eliminates long-standing technical debt, improving both compile times and maintainability.
Removed Macros
Over 50 legacy macros have been removed in favor of modern C++ alternatives:
CUB_{MIN,MAX}
: usecuda::std::{min,max}
instead #3821THRUST_NODISCARD
: use[[nodiscard]]
instead #3746THRUST_INLINE_CONSTANT
: use `inline constexpr` instead #3746- See CCCL 2.x to 3.0 Migration Guide for complete list
Removed Functions and Classes
thrust::optional
: usecuda::std::optional
instead #4172thrust::tuple
: usecuda::std::tuple
instead #2395thrust::pair
: usecuda::std::pair
instead #2395thrust::numeric_limits
: usecuda::std::numeric_limits
instead #3366cub::BFE
: use `cuda::bitfield_inser`t andcuda::bitfield_extract
instead #4031cub::ConstantInputIterator
: usethrust::constant_iterator
instead #3831cub::CountingInputIterator
: usethrust::counting_iterator
instead #3831cub::GridBarrier
: use cooperative groups instead #3745cub::DeviceSpmv
: use cuSPARSE instead #3320cub::Mutex
: usecuda::std::mutex
instead #3251- See CCCL 2.x to 3.0 Migration Guide for complete list
New Features
C++
cuda::
cuda::std::numeric_limits
now supports__float128
#4059cuda::std::optional<T&>
implementation (P2988) #3631cuda::std::numbers
header for mathematical constants #3355NVFP8/6/4
extended floating-point types support in<cuda/std/cmath>
#3843cuda::overflow_cast
for safe numeric conversions #4151cuda::ilog2
andcuda::ilog10
integer logarithms #4100cuda::round_up
andcuda::round_down
utilities #3234
cub::
- `cub::DeviceSegmentedReduce` now supports large number of segments #3746
- `cub::DeviceCopy::Batched` now supports large number of buffers #4129
- `cub::DeviceMemcpy::Batched` now supports large number of buffers #4065
thrust::
- New `thrust::offset_iterator` iterator #4073
- Temporary storage allocations in parallel algorithms now respect `par_nosync` #4204
Python
CUDA Python Core Libraries are now available on PyPI through the cuda-cccl
package.
pip install cuda-cccl
cuda.cccl.cooperative
- Block-level sorting now supports multi-dimensional thread blocks #4035, #4028
- Block-level data movement now supports multi-dimensional thread blocks #3161
- New block-level inclusive sum algorithm #3921
cuda.cccl.parallel
- New device-level segmented-reduce algorithm #3906
- New device-level unique-by-key algorithm #3947
- New device-level merge-sort algorithm #3763
What's Changed
🚀 Thrust / CUB
- Drop cub::Mutex by @bernhardmgruber in #3251
- Remove legacy macros from CUB util_arch.cuh by @bernhardmgruber in #3257
- Remove thrust::[unary|binary]_traits by @bernhardmgruber in #3260
- Drop thrust not1 and not2 by @bernhardmgruber in #3264
- Deprecate GridBarrier and GridBarrierLifetime by @bernhardmgruber in #3258
- Drop thrust::[unary|binary]_function by @bernhardmgruber in #3274
- Enable thrust::identity test for non-MSVC by @bernhardmgruber in #3281
- Enable PDL in triple chevron launch by @bernhardmgruber in #3282
- Drop Thrust legacy arch macros by @bernhardmgruber in #3298
- Drop Thrust's compiler_fence.h by @bernhardmgruber in #3300
- Drop CUB's util_compiler.cuh by @bernhardmgruber in #3302
- Drop Thrust's deprecated compiler macros by @bernhardmgruber in #3301
- Drop CUB_RUNTIME_ENABLED and THRUST_HAS_CUDART by @bernhardmgruber in #3305
- Require C++17 for compiling Thrust and CUB by @bernhardmgruber in #3255
- Deprecate Thrust's cpp_compatibility.h macros by @bernhardmgruber in #3299
- Deprecate cub::IterateThreadStore by @bernhardmgruber in #3337
- Drop CUB's BinaryFlip operator by @bernhardmgruber in #3332
- Deprecate cub::Swap by @bernhardmgruber in #3333
- Drop CUB APIs with a debug_synchronous parameter by ...
v2.8.5
What's Changed
- Avoid plain
assert
in device code by @miscco in #4707 - Do not use open-coded
INFINITY
for tests that also test extended floating points by @miscco in #4744 - [Version] Update branch/2.8.x to v2.8.5 by @github-actions in #4755
- [Backport branch/2.8.x] Update Blackwell PTX instruction availability tables by @github-actions in #3900
Full Changelog: v2.8.4...v2.8.5
v2.8.4
What's Changed
- [BACKPORT] Do not use pack indexing with clang-19 by @miscco in #4447
- [Backport branch/2.8.x] Always bypass automatic atomic storage checks to prevent potential compiler issues by @github-actions in #4616
- [Version] Update branch/2.8.x to v2.8.4 by @github-actions in #4655
Full Changelog: v2.8.3...v2.8.4
v2.8.3
What's Changed
- [BACKPORT: 2.8] Set NO_CMAKE_FIND_ROOT_PATH for cudax. (#4162) by @miscco in #4216
- [BACKPORT 2.8] Fix the cuda python setup by @miscco in #4218
- Backport PR #4221 to branch/2.8.x — Remove python/cuda_cooperative/setup.py by @rwgk in #4235
- [Backport branch/2.8.x] Remove invalid single
#
in builtin.h by @github-actions in #4326 - [BACKPORT 2.8] Allow rapids to avoid unrolling some loops in sort (#4253) by @miscco in #4387
- [Backport branch/2.8.x] Fix uninitialized read in local atomic code path. by @github-actions in #4424
- [Version] Update branch/2.8.x to v2.8.3 by @github-actions in #4423
Full Changelog: v2.8.2...v2.8.3