Skip to content

Releases: NVIDIA/cccl

v2.8.2

02 Apr 16:26
207d20f
Compare
Choose a tag to compare

What's Changed

  • [Version] Update branch/2.8.x to v2.8.2 by @github-actions in #4079
  • Ignore Wmaybe-uninitialized in dispatch_reduce.cuh. by @bdice in #4054
  • backport: fix numeric_limits digits for nvfp8/6/4 (#4070) by @miscco in #4130
  • [BACKPORT]: Avoid compiler issue with MSVC and span constructor by @miscco in #4127

Full Changelog: v2.8.1...v2.8.2

v2.8.1

11 Mar 00:24
ac7eb2a
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.8.0...v2.8.1

CCCL 2.8.0

03 Mar 21:03
6d02e11
Compare
Choose a tag to compare

Release highlights

  • With the increase of Python APIs, we changed our name from CUDA C++ Core Libraries to CUDA Core Compute Libraries
  • This is the last release supporting C++11 and C++14. CCCL 3.0 will require C++17.
  • We deprecated many entities and plan to remove them in CCCL 3.0, see the migration guide
  • Internal assertions are now turned on by default for users in CMake Debug builds. Non-Debug and non-CMake builds are not affected.
  • Many CUB algorithms were tuned for Blackwell GPUs

New C++ APIs

  • thurst::transform_inclusive_scan with init value
  • thrust::universal_host_pinned_vector, a Thrust vector with managed and pinned memory
  • cub::DeviceTransform using bulk copy on Hopper+ or prefetch, used by thrust::transform
  • cub::DeviceFor::ForEachInExtents to iterate over a multidimensional index space described by cuda::std::extents
  • cuda::mr::cuda_async_memory_resource, a memory resource for stream-ordered allocations
  • cuda::get_device_address, which returns the device address of a device object
  • New function objects cuda::minimum and cuda::maximum
  • New standard C++ functions in cuda::std::: :assume_aligned, source_location, dims, ignore, invoke_r, byteswap
  • Exposure of several new PTX instructions in cuda::ptx: cp.async.mbarrier.arrive, mbarrier.expect_tx, clusterlaunchcontrol.*, st.bulk, multimem.ld_reduce, multimem.st, multimem.red, and many tcgen05 instructions
  • cuda::experimental::basic_any: a utility for defining type-erasing wrappers in terms of an interface description
  • Add general support for sm_101, sm_101a and sm_120 architectures
  • The following CUB APIs support large number of items (> 2^32): DeviceSelect, DevicePartition, DeviceScan::*ByKey, DeviceReduce::{ArgMin,ArgMax}, DeviceTransform
  • cuda::std::numeric_limits now supports FP16 (half and bfloat16) and FP8 types
  • cuda::is_floating_point_v supporting any standard and extended (FP16, FP8) floating point types

Fixes

  • cuda::atomic_ref now supports 8 and 16 bit operations - #2255
  • Atomics placed in thread local memory will no longer exhibit undefined behavior - #2586

What's Changed in Detail

Read more

CCCL 2.7.0

06 Jan 22:12
v2.7.0
b5fe509
Compare
Choose a tag to compare

What’s New

C++

Thrust / CUB

  • Inclusive scan now supports initial value #1940
  • Inclusive and exclusive scan now support problem sizes exceeding 2^31 elements #2171
  • New cub::DeviceMerge::MergeKeys and cub::DeviceMerge::MergePairs algorithms #1817
  • New thrust::tabulate_output_iterator fancy iterator #2282

Libcudacxx

  • Enable Assertions on host and device depending on users choice
  • C++26 inplace_vector has been implemented and backported to C++14
  • Improved support for extended floating point types __half and __nv_bfloat16 both for cmath functions and complex
  • cuda::std::tuple is now trivially copyable if the stored types are trivially copyable
  • Reworked our atomics implementation
  • Improved <cuda/std/bit> conformance
  • Implemented <cuda/std/bitset> and backported to C++14
  • Implemented and backported C++20 bit_cast. It is available in all standard modes and constexpr with compiler support
  • Various backports and constexpr improvements (bool_constant, cuda::std::max)
  • Moved the experimental memory resources from <cuda/memory_resource> into <cuda/experimental/memory_resource.cuh>

Python

cuda.cooperative

Best practice of using CCCL to make your CUDA kernels easier to write and faster to execute is now available in Python through the cuda.cooperative module. This module currently supports block- and warp-level algorithms within numba.cuda kernels, offering speed-of-light reductions, prefix sums, radix, and merge sort. You can customize cuda.cooperative algorithms with user-defined data types and operators, implemented directly in Python.

Block and warp-level cooperative algorithms are now available in Python #1973.
Experimental versions of reduce, scan, merge and radix sort are available in numba.cuda kernels.

cuda.parallel

Apart from device-side cooperative algorithms, CCCL 2.7 provides an experimental version of host-side parallel algorithms as part of the cuda.parallel module. This release includes parallel reduction.

What's Changed

Read more

CCCL 2.6.1

10 Sep 18:45
v2.6.1
9019a6a
Compare
Choose a tag to compare

This release includes backports for PRs #2332 and #2341. Please see release 2.6.0 for the full list of changes included in the release.

What's Changed

Full Changelog: v2.6.0...v2.6.1

CCCL 2.6.0

04 Sep 17:42
c67b1c3
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.5.0...v2.6.0

CCCL 2.5.0

17 Jun 18:00
69be18c
Compare
Choose a tag to compare

What's New

This release includes several notable improvements and new features:

  • CUB device-level algorithms now support NVTX ranges in Nsight Systems. This integration makes it easier to identify and analyze the time spent in CUB algorithms. Please note that profiling with this feature requires at least C++14.
  • We have added new cub::DeviceSelect::FlaggedIf API, which allows you to select items based on applying a predicate to flags. This addition provides more flexibility and control over item selection.

What's Changed

Read more

v2.4.0

23 Apr 21:30
1c009d2
Compare
Choose a tag to compare

What’s New

We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

Thrust

As part of our kernel consolidation effort, kernels of thrust::unique_by_key, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.

CUB

  • cub::DeviceSelect::UniqueByKey now supports equality operator and large problem sizes.
  • New cub::DeviceFor family of algorithms goes beyond conventional cub::DeviceFor::ForEach. cub::DeviceFor::ForEachCopy can provide you with additional performance benefits from vectorized memory accesses.
  • Many CUB algorithms now support CUDA graph capture mode.

libcudacxx

  • Added new cuda::ptx namespace with wrappers for inline-PTX instructions
  • cuda::std::complex specializations for CUDA types bfloat and half.

What's Changed

Read more

v2.3.2

12 Mar 20:22
64d3a5f
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.3.1...v2.3.2

v2.3.1

23 Apr 21:29
299eb62
Compare
Choose a tag to compare

What's Changed

  • [BACKPORT]: Fix bug in stream_ref::wait by @miscco in #1283
  • Revert "Refactor thrust::complex as a struct derived from cuda::std::complex (#454)" by @miscco in #1286
  • Create patch 2.3.1 by @wmaxey in #1287

Full Changelog: v2.3.0...v2.3.1