Releases · NVIDIA/cccl

02 Apr 16:26

github-actions

v2.8.2

207d20f

v2.8.2

What's Changed

[Version] Update branch/2.8.x to v2.8.2 by @github-actions in #4079
Ignore Wmaybe-uninitialized in dispatch_reduce.cuh. by @bdice in #4054
backport: fix numeric_limits digits for nvfp8/6/4 (#4070) by @miscco in #4130
[BACKPORT]: Avoid compiler issue with MSVC and span constructor by @miscco in #4127

Full Changelog: v2.8.1...v2.8.2

Contributors

miscco and bdice

Assets 6

11 Mar 00:24

github-actions

v2.8.1

ac7eb2a

v2.8.1

What's Changed

Backport to 2.8: NVHPC fixes by @bernhardmgruber in #4021
[Backport 2.8.x] [cuda::ptx] Fix .cta_group::2 definition (#4038) by @wmaxey in #4044
[Version] Update branch/2.8.x to v2.8.1 by @github-actions in #4049

Full Changelog: v2.8.0...v2.8.1

Contributors

bernhardmgruber and wmaxey

Assets 6

03 Mar 21:03

wmaxey

v2.8.0

6d02e11

CCCL 2.8.0 Latest

Latest

Release highlights

With the increase of Python APIs, we changed our name from CUDA C++ Core Libraries to CUDA Core Compute Libraries
This is the last release supporting C++11 and C++14. CCCL 3.0 will require C++17.
We deprecated many entities and plan to remove them in CCCL 3.0, see the migration guide
Internal assertions are now turned on by default for users in CMake Debug builds. Non-Debug and non-CMake builds are not affected.
Many CUB algorithms were tuned for Blackwell GPUs

New C++ APIs

thurst::transform_inclusive_scan with init value
thrust::universal_host_pinned_vector, a Thrust vector with managed and pinned memory
cub::DeviceTransform using bulk copy on Hopper+ or prefetch, used by thrust::transform
cub::DeviceFor::ForEachInExtents to iterate over a multidimensional index space described by cuda::std::extents
cuda::mr::cuda_async_memory_resource, a memory resource for stream-ordered allocations
cuda::get_device_address, which returns the device address of a device object
New function objects cuda::minimum and cuda::maximum
New standard C++ functions in cuda::std::: :assume_aligned, source_location, dims, ignore, invoke_r, byteswap
Exposure of several new PTX instructions in cuda::ptx: cp.async.mbarrier.arrive, mbarrier.expect_tx, clusterlaunchcontrol.*, st.bulk, multimem.ld_reduce, multimem.st, multimem.red, and many tcgen05 instructions
cuda::experimental::basic_any: a utility for defining type-erasing wrappers in terms of an interface description
Add general support for sm_101, sm_101a and sm_120 architectures
The following CUB APIs support large number of items (> 2^32): DeviceSelect, DevicePartition, DeviceScan::*ByKey, DeviceReduce::{ArgMin,ArgMax}, DeviceTransform
cuda::std::numeric_limits now supports FP16 (half and bfloat16) and FP8 types
cuda::is_floating_point_v supporting any standard and extended (FP16, FP8) floating point types

Fixes

cuda::atomic_ref now supports 8 and 16 bit operations - #2255
Atomics placed in thread local memory will no longer exhibit undefined behavior - #2586

What's Changed in Detail

Adds benchmarks for DeviceSelect::Unique by @elstehle in #2359
CUB - Enable DPX Reduction by @fbusato in #2286
[CUDAX] add a small c++17 implementation of std::execution (aka P2300) by @ericniebler in #2301
Add thurst::transform_inclusive_scan with init value by @gonidelis in #2326
Widen histogram agent constructor to more types by @bernhardmgruber in #2380
Use a constant for the amount of static SMEM by @bernhardmgruber in #2374
Add cub::DeviceTransform by @bernhardmgruber in #2086
Update toolkit to CTK 12.6 by @miscco in #2348
implement make_integer_sequence in terms of intrinsics whenever possible by @ericniebler in #2384
Implement cuda::mr::cuda_async_memory_resource by @miscco in #1637
Drop implementation of thrust::pair and thrust::tuple by @miscco in #2395
Pull out _LIBCUDACXX_UNREACHABLE into its own file by @miscco in #2399
Share common compiler flags in new CCCL-level targets. by @alliepiper in #2386
conditionally include <crt/host_defines.h> from __cccl/execution_space.h header by @ericniebler in #2406
add some simple utilities for manipulating lists of types by @ericniebler in #2370
Drop thrusts diagnostic suppression warnings by @miscco in #2392
[PoC]: Implement cuda::experimental::uninitialized_async_buffer by @miscco in #1854
Fix thrust package to work with newer FindOpenMP.cmake. by @alliepiper in #2421
Introduce cccl_configure_target cmake function. by @alliepiper in #2388
Fix sccache errors in RAPIDS builds by @trxcllnt in #2417
Replace CUDA C++ Core Libraries with CUDA Core Compute Libraries (only in README.md). by @rwgk in #2424
Minor cleanup with cuda/atomic by @miscco in #2418
uninitialized_buffer::get_resource returns a ref to an any_resource that can be copied by @ericniebler in #2431
Refactor cuda::ceil_div to take two different types by @miscco in #2376
Reduce PR testing matrix. by @alliepiper in #2436
Implement cudax::shared_resource by @miscco in #2398
Increase the libcu++ timeout by @miscco in #2435
Move c/include/cccl/.h files to c/include/cccl/c/.h by @rwgk in #2428
Make any_resource emplacable by @miscco in #2425
Fix issues with __host__ and __device__ definitions by @miscco in #2413
Make bit_cast play nice with extended floating point types by @miscco in #2434
Do not include our own string.h file by @miscco in #2444
Move nightly time by @bdice in #2437
Remove a ton of lines in thrust tests by @gonidelis in #2356
[CUDAX] Add placeholder green context type and logical device that can hold both a green ctx and a device by @pciolkosz in #2446
Fix typo in CCCLBuildCompilerTargets.cmake by @alliepiper in #2453
Drop superflous compile definition from thrust tests by @miscco in #2450
Consolidate packages and install rules by @alliepiper in #2456
Prune CUB's ChainedPolicy by CUDA_ARCH_LIST by @bernhardmgruber in #2154
fixes merge conflict for policy pruning by @elstehle in #2466
Add CCCL_ENABLE_WERROR flag. by @alliepiper in #2463
Add CUB tests for segmented sort/radix sort with 64-bit num. items and segments by @fbusato in #2254
Propagate compiler flags down to libcu++ LIT tests by @Artem-B in #2420
Drop remaining uses of _LIBCUDACXX_COMPILER_* by @miscco in #2467
Avoid C++17 extension in c++11 tests by @miscco in #2469
Add span to example and templated block size by @Kh4ster in #2470
Drop Objective C++ support by @miscco in #2468
removes superfluous template keyword in call to Dereference by @andrewcorrigan in #2482
Improve build times in several heavyweight libcudacxx tests. by @wmaxey in #2478
Drop __availability header by @miscco in #2484
Replace a few more instances of CUDA C++ Core Libraries with CUDA Core Compute Libraries`. by @rwgk in #2447
Fix common_type specialization for extended floating point types by @miscco in #2483
Implement some CUDA API calls for async_memory_pool by @miscco in #2455
Move cudax example project to CCCL project examples. by @alliepiper in #2462
Disable system header for narrowing conversion check by @miscco in #2465
Require resources to always provide at least one execution space property by @miscco in #2489
Rework builtin handling by @miscco in #2461
Disable execution checks for std::equal by @miscco in #2491
replace _CCCL_ALWAYS_INLINE with _CCCL_FORCEINLINE by @ericniebler in #2439
Drop 2 relative includes that snuck in by @miscco in #2492
re-express the cudax::__tupl::__apply member to make nvc++ happy by @ericniebler in #2493
Drop badly named _One_of concept by @miscco in #2490
Unify assert handling in cccl by @miscco in #2382
Reduce scope of Thrust linkage in cudax. by @alliepiper in #2496
Centralize CPM logic. by @alliepiper in #2495
Fix typo in presets. by @alliepiper in #2497
Refactor away per-project TOPLEVEL flags. by @alliepiper in #2498
[FEA]: Validate cuda.parallel type matching in build and execution by @rwgk in #2429
avoid gcc optimizer bug by not force inlining part of thrust::transform by @ericniebler in #2509
Cleanup and modularize <cuda/std/barrier> by @miscco in https://github.com/N...

Contributors

alliepiper, trxcllnt, and 31 other contributors

Assets 2

06 Jan 22:12

wmaxey

v2.7.0

b5fe509

CCCL 2.7.0

What’s New

C++

Thrust / CUB

Inclusive scan now supports initial value #1940
Inclusive and exclusive scan now support problem sizes exceeding 2^31 elements #2171
New cub::DeviceMerge::MergeKeys and cub::DeviceMerge::MergePairs algorithms #1817
New thrust::tabulate_output_iterator fancy iterator #2282

Libcudacxx

Enable Assertions on host and device depending on users choice
C++26 inplace_vector has been implemented and backported to C++14
Improved support for extended floating point types __half and __nv_bfloat16 both for cmath functions and complex
cuda::std::tuple is now trivially copyable if the stored types are trivially copyable
Reworked our atomics implementation
Improved <cuda/std/bit> conformance
Implemented <cuda/std/bitset> and backported to C++14
Implemented and backported C++20 bit_cast. It is available in all standard modes and constexpr with compiler support
Various backports and constexpr improvements (bool_constant, cuda::std::max)
Moved the experimental memory resources from <cuda/memory_resource> into <cuda/experimental/memory_resource.cuh>

Python

cuda.cooperative

Best practice of using CCCL to make your CUDA kernels easier to write and faster to execute is now available in Python through the cuda.cooperative module. This module currently supports block- and warp-level algorithms within numba.cuda kernels, offering speed-of-light reductions, prefix sums, radix, and merge sort. You can customize cuda.cooperative algorithms with user-defined data types and operators, implemented directly in Python.

Block and warp-level cooperative algorithms are now available in Python #1973.
Experimental versions of reduce, scan, merge and radix sort are available in numba.cuda kernels.

cuda.parallel

Apart from device-side cooperative algorithms, CCCL 2.7 provides an experimental version of host-side parallel algorithms as part of the cuda.parallel module. This release includes parallel reduction.

What's Changed

Fix documentation generation for thrust::pair by @bernhardmgruber in #1976
Correct typo in a launch configuration header name by @pciolkosz in #1972
Fix thrust::sort for large problem sizes by @gevtushenko in #1952
Avoid SIGPIPE when truncating verbose output in CI scripts. by @alliepiper in #1971
Clarify compiler support by @bernhardmgruber in #1970
Experimental Python cooperative algorithms by @gevtushenko in #1973
[pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #1928
Guard against an overflow in sort tests by @bernhardmgruber in #1980
Remove obsolete Thrust function traits by @bernhardmgruber in #1962
Python: Add version string & wheel build command by @leofang in #1985
Add device inclusive scan with init_value by @gonidelis in #1845
Fix BWUtil report on early exit by @gonidelis in #1994
Use libcu++ void_t everywhere by @bernhardmgruber in #1977
Drop zipped_binary_op by @bernhardmgruber in #1988
Clarify PtxVersion and SmVersion by @bernhardmgruber in #2004
More simplifications for CUB util_device by @bernhardmgruber in #1948
fix some typos in <cuda/stream_ref> by @ericniebler in #2003
Add CI slack notifications. by @alliepiper in #1961
Allow nightly workflow to be manually invoked. by @alliepiper in #2007
Need to use a different approach to reuse secrets in reusable workflows vs. actions. by @alliepiper in #2008
Enable RAPIDS builds for manually dispatched workflows. by @alliepiper in #2009
clean up complex.inl by @ZelboK in #1655
Add github token to nightly workflow-results action. by @alliepiper in #2012
Remove obsolete build system glue from the Thrust/CUB submodule structure. by @alliepiper in #2016
Benchmark thrust::copy with non-trivially relocatable type by @bernhardmgruber in #1989
Make bool_constant available in C++11 by @bernhardmgruber in #1997
Spell value initialization where used in thrust vectors by @bernhardmgruber in #1990
Do no redefine __ELF__ macro by @miscco in #2018
Port thrust::merge[_by_key] to CUB by @bernhardmgruber in #1817
Simplify some pointer traits by @bernhardmgruber in #2020
Simplify test data setup by @bernhardmgruber in #2023
Add tests to ensure that we properly propagate common_type for complex types by @miscco in #2025
Update Thrust CMake README to use CCCL repo. by @alliepiper in #2026
Include container toolkit in manual prereqs by @bryevdv in #2064
Avoid ADL issues with thrust::distance by @miscco in #2053
Simplify thrust::detail::wrapped_function by @bernhardmgruber in #2019
Add a test for Thrust scan with non-commutative op by @bernhardmgruber in #2024
Update memory_resource docs by @miscco in #1883
Temporarily switch nightly H100 CI to build-only. by @alliepiper in #2060
Do not rely on conversions between float and extended floating point types by @miscco in #2046
experimental wrapper types for cudaEvent_t that provide a modern C++ interface. by @ericniebler in #2017
[CUDAX] Add a dummy device struct for now by @pciolkosz in #2066
Allow (somewhat) different input value types for merge by @bernhardmgruber in #2075
Avoid ::result_type for partial sums in TBB reduce_by_key by @bernhardmgruber in #1998
Fix formatting by @bernhardmgruber in #2090
Rename and refactor transform_iterator_base by @bernhardmgruber in #1987
Benchmark analysis: Print all top rows when asked for by @bernhardmgruber in #2089
Makes user-provided functors in our examples use __device__ instead of CUB_RUNTIME_FUNCTION by @elstehle in #2088
Separate cuda/experimental when sorting includes by @bernhardmgruber in #2094
add support to cudax::device for querying a device's attributes by @ericniebler in #2084
[CUDAX] Add experimental owning abstraction for cudaStream_t by @pciolkosz in #2093
Do not query NVRTC for cuda runtime header by @miscco in #2102
Cleanup CUB block/thread load and exchange by @bernhardmgruber in #1946
Improve binary function objects and replace thrust implementation by @srinivasyadav18 in #1872
Replace _LIBCUDACXX_CPO_ACCESSIBILITY with _CCCL_GLOBAL_VARIABLE by @miscco in #1881
Add script to update RAPIDS version. by @bdice in #2082
Update bad links by @bryevdv in #2080
Fix line break issues that break doxygen code examples by @miscco in #2103
Add internal wrapper for cuda driver APIs by @pciolkosz in #2070
Use common_type for complex pow by @miscco in #1800
[CUDAX] rename device to device_ref, add immovable device as a place to cache properties by @ericniebler in #2110
Use the float flavors of the cmath functions in the extended floating point fallbacks by @miscco in #2106
[PoC]: Implement cuda::experimental::uninitialized_buffer by @miscco in #1831
Ensure that we avoid ABI Version conflics by @miscco in #2137
Ensure that cuda_memory_resource allocates memory on the proper device by @miscco in #2073
Clarify compatibility wrt. template specializations by @bernhardmgruber in #2138
Implement a cudax::get_stream CPO by @miscco in #2135
Make cuda::std::tuple trivially copyable by @miscco in #2127
Fix missing copy of docs artifacts by @miscco in #2162
...

Contributors

alliepiper, ericniebler, and 19 other contributors

Assets 2

10 Sep 18:45

wmaxey

v2.6.1

9019a6a

CCCL 2.6.1

This release includes backports for PRs #2332 and #2341. Please see release 2.6.0 for the full list of changes included in the release.

What's Changed

Backport PR #2332 and #2341 by @wmaxey in #2368

Full Changelog: v2.6.0...v2.6.1

Contributors

wmaxey

Assets 2

04 Sep 17:42

wmaxey

v2.6.0

c67b1c3

CCCL 2.6.0

What's Changed

Restrict active histogram channels to channel count by @bernhardmgruber in #1796
Cleanup internal thrust CUDA utils by @bernhardmgruber in #1802
Use variadic interfaces in agent launcher by @bernhardmgruber in #1804
Use nullptr over NULL by @bernhardmgruber in #1805
Rework the documentation to be build with sphinx by @miscco in #1753
Let Catch2 report cudaError descriptions by @bernhardmgruber in #1808
Check size-querying CUB API invocation in tests by @bernhardmgruber in #1809
Update docs link by @gevtushenko in #1812
Add missing inline specifiers by @bernhardmgruber in #1813
Upgrade actions that use node16 to versions that use node20 by @trxcllnt in #1779
Document NVTX range behavior during graph capture by @bernhardmgruber in #1814
Clean up AliasTemporaries by @bernhardmgruber in #1815
Drop removed clang-tidy option by @bernhardmgruber in #1810
Exclude docs from cccl infra changes. by @alliepiper in #1821
Clean up thrust merge unit tests by @bernhardmgruber in #1819
Fix atomic performance regressions by avoiding use of memcpy with natively supported atomic types. by @wmaxey in #1801
Clean up merge_by_key and merge_key_value tests by @bernhardmgruber in #1824
Restore the old thrust api documentation in rst by @miscco in #1818
Drop all internal implementations of exceptions by @miscco in #1806
Fix span for non-ranges by @miscco in #1836
Cleanup thrust test special types by @bernhardmgruber in #1837
Add inclusive_scan with initial value support (warp/block) by @gonidelis in #1749
Fix loading from incorrect URI on 404 page. by @wmaxey in #1843
Port CUB temporary storage layout test to Catch2 by @bernhardmgruber in #1835
Port CUB thread operators test to Catch2 by @bernhardmgruber in #1834
Adds ceil_div by @gonzalobg in #1825
Split workflow into multiple dispatch groups to avoid skipped jobs. by @alliepiper in #1797
Fix broken CUB doc build and add 404 page to Sphinx. by @wmaxey in #1846
Port CUB thread sort test to Catch2 by @bernhardmgruber in #1838
Cleanup CUB temporary storage layout test by @bernhardmgruber in #1848
Propogate error when docsbuild fails, add docs build to CI. by @alliepiper in #1852
Cleanup CUB util_macro.cuh by @bernhardmgruber in #1849
Provide libcu++ transparent functors in C++11 by @bernhardmgruber in #1851
Roll back upload-pages-artifact to v2. by @alliepiper in #1861
Port CUB iterator test to Catch2 by @bernhardmgruber in #1822
Symbol visibility is now invariant in regards to __cuda_std__ definition by @robertmaynard in #1832
Add dimensions description functionality to CUDA Experimental library by @pciolkosz in #1743
Document Asynchronous Operations by @gonzalobg in #1781
Remove cpp11_required.h by @bernhardmgruber in #1860
Add workflow to build RAPIDS from source with local CCCL by @trxcllnt in #1667
Refactor CI matrix. by @alliepiper in #1844
Adds tests for large number of items in cub::DeviceScan by @elstehle in #1830
Make CUB test launch wrappers functor instances by @bernhardmgruber in #1850
Improve CUB test overview docs by @bernhardmgruber in #1867
Skip devcontainer validation jobs if not needed. by @alliepiper in #1853
Improve CUB device-scope documentation by @bernhardmgruber in #1862
Make integer sequence et al. available in C++11 by @bernhardmgruber in #1859
Minimize template instantiations in CUB thread_load by @bernhardmgruber in #1857
Create major version 2.6.0 by @wmaxey in #1880
Drop facilities deprecated in CUB 1.x by @bernhardmgruber in #1868
Make thrust::sort use radix sort with more comparators by @bernhardmgruber in #1884
Make cuda::ptx::*_multicast pass on all architectures by @ahendriksen in #1874
Replace typedef by alias declarations in CUB by @bernhardmgruber in #1885
Remove legacy benchmarks and other dvs/p4 remnants by @alliepiper in #1901
Qualify call to distance in thrust::async_reduce by @bernhardmgruber in #1904
Rename CUB uninitialized_copy by @bernhardmgruber in #1913
Sanitizer fixes by @alliepiper in #1916
Use c2h::vectors in all non-example CUB tests by @bernhardmgruber in #1914
Renamed overlooked uninitialized_copy by @bernhardmgruber in #1920
Add assert implementation for device side testing by @pciolkosz in #1918
Thrust and CUB: README: Fix copy-paste from libcu++ and links by @pauleonix in #1878
Follow-up fixes to CUB iterator test by @bernhardmgruber in #1875
Replace typedef by alias declarations in Thrust by @bernhardmgruber in #1915
Cleanup CUB util_type.cuh by @bernhardmgruber in #1863
Fix include for in cub/util_type.cuh by @bernhardmgruber in #1929
Fix issues with comments in the concept emulation by @miscco in #1931
Deprecate and reduce use of old functional stuff by @bernhardmgruber in #1925
Deprecate more nested aliases in thrust functors by @bernhardmgruber in #1932
Fix various typos in CUB documentation and comments. by @brycelelbach in #1933
Add BabelStream flavors as thrust::transform benchmarks by @bernhardmgruber in #1921
Some cleanup in Thrust config headers by @bernhardmgruber in #1934
Update to CUDA 12.5 containers by @jrhemstad in #1935
Check that the current version of CMake supports policy 141 before se… by @alliepiper in #1924
Fix memmove optimization by @miscco in #1937
Fixes thrust::unique_by_key examples by @elstehle in #1943
Use only explicit NVTX3 V1 API in CUB by @bernhardmgruber in #1751
Suppress a clang warning on array size computation by @bernhardmgruber in #1942
Add a benchmark for thrust::equal by @bernhardmgruber in #1944
Strip prefix paths to improve doc rendering by @bdice in #1954
Modernize Thrust's alignment.h and triple_chevron_launch by @bernhardmgruber in #1905
Restore RAPIDS devcontainer by @bdice in #1955
Fix for in-place DeviceSelect & thrust::remove_if by @elstehle in #1782
Drop Thrust's cstdint.h by @bernhardmgruber in #1959
Use make_devcontainers.sh --clean when validating. by @alliepiper in #1963
Fix missing binary_pred in thrust::unique_by_key by @bernhardmgruber in #1957
cuda::launch and launch configuration object with minimal functionality by @pciolkosz in #1950
Backport PR #2046 - Fixing FP16 conversions. by @wmaxey in #2222

Full Changelog: v2.5.0...v2.6.0

Contributors

alliepiper, trxcllnt, and 14 other contributors

Assets 2

17 Jun 18:00

wmaxey

v2.5.0

69be18c

CCCL 2.5.0

What's New

This release includes several notable improvements and new features:

CUB device-level algorithms now support NVTX ranges in Nsight Systems. This integration makes it easier to identify and analyze the time spent in CUB algorithms. Please note that profiling with this feature requires at least C++14.
We have added new cub::DeviceSelect::FlaggedIf API, which allows you to select items based on applying a predicate to flags. This addition provides more flexibility and control over item selection.

What's Changed

Clean up libcu++ docs landing page by @jrhemstad in #1492
PTX: Add cuda::ptx::elect_sync by @ahendriksen in #1537
Print a summary of all tests sorted by execution time. by @alliepiper in #1539
Fix unused variable warning for __can_use_complete_tx by @wmaxey in #1547
Fix usage of naked array with 0 elements in sm90 barrier tests. by @wmaxey in #1546
Add support for stream operators for complex by @miscco in #1538
Fix __half for older architectures by @miscco in #1543
Feat 565 remove redundant thrust dialect conditional by @ZelboK in #566
fix missing device hint in WarpMergeSort Documentation by @MARD1NO in #1553
Minor fixes and additions on cub developer guides by @gonidelis in #1559
Consolidate handling of constexpr and if constexpr by @miscco in #1562
Ensure that cuda::aligned_size_t is usable in a constexpr context by @miscco in #1564
Group CUB docs by @gevtushenko in #1565
Update toolkit to 12.4 by @miscco in #1554
Work around change in cuTensorMapEncode by @miscco in #1567
Remove stdlib arg from .clangd. by @alliepiper in #1569
Add the DeviceSelect::FlaggedIf algorithm by @gonidelis in #1533
Catch2 segmented sort by @alliepiper in #1484
Do not emit diagnostic with extended device lambdas with preserved re… by @Revaj in #1495
Use absolute includes for libcu++ by @miscco in #1560
[NFC] Modularize <exception> by @miscco in #199
Add test support for launching kernels with cluster size > 1 by @ahendriksen in #416
Fix typo in README.md by @bprb in #1574
[FEA]: Modularize <cuda/memory_resource> by @miscco in #1532
Cleanup_complex by @miscco in #1555
Add missing comma in barrier __try_wait by @miscco in #1593
Segmented sort test fix by @alliepiper in #1591
Add pre-commit configuration by @bdice in #1596
Preserve .devcontainer/img/ when cleaning. by @alliepiper in #1604
Add some documentation for recent additions to libcu++ by @miscco in #1594
Ensure cuda::std::nullopt is visible in device code by @trxcllnt in #1598
Fix ordering of alignas and __shared__ by @miscco in #1601
Update Thrust CI tests. by @alliepiper in #1605
Implement tuple interface for cuda vector types by @miscco in #1410
Inspect PR changes to determine if subproject builds are needed. by @alliepiper in #1572
Apply clang-format to cub by @bdice in #1602
Add missing non-volatile atomic overloads. by @wmaxey in #1582
Drop unused libcxx files by @miscco in #1606
Apply formatting to libcudacxx by @miscco in #1610
Add conda documentation to the README. by @bdice in #1581
Allow jobs to be skipped. by @alliepiper in #1611
Make libcu++ work with exceptions by @miscco in #1607
Implement cuda::mr::cuda_memory_resource by @miscco in #1578
Implement cuda::mr::managed_memory_resource by @miscco in #1579
Apply formatting to thrust by @miscco in #1616
Update example_device_radix_sort.cu by @eriktedhamre in #1608
Implement cuda::mr::pinned_memory_resource by @miscco in #1580
Set the devcontainers to format on save. by @miscco in #1624
Enable internal use of std::allocator related functionality by @miscco in #1583
Adds tests for large number of items for cub::DeviceSelect by @elstehle in #1612
Add pre-commit docs to CONTRIBUTING.md. by @bdice in #1627
Move visibility attributes to cccl by @miscco in #1595
Work around thrust/memory.h circular include by @dkolsen-pgi in #1634
Fix mbarrier.init addressing by @ahendriksen in #1636
Trim trailing whitespace and normalize newlines. by @bdice in #1633
Add a git-blame-ignore-revs file by @miscco in #1629
Revert "PTX: Add cuda::ptx::elect_sync (#1537)" by @ahendriksen in #1638
Address potential oob in cub when passing in an invalid device counter by @miscco in #1641
Allow ninja_summary to fail by @jrhemstad in #1644
Mostly flatten the folder structure of libcu++ by @miscco in #1630
Make --cmake-options="" always override others. by @alliepiper in #1648
Fix invalid _CCCL_CUDACC definition for clang cuda by @miscco in #1656
Add missing #pragma once in some headers by @bernhardmgruber in #1668
Add NVTX ranges for all CUB algorithms by @bernhardmgruber in #1657
Implement LWG-3843 and LWG-3940 by @miscco in #1621
Modularize <memory> by @miscco in #1639
Expose <cuda/std/numeric> to be publicly available by @miscco in #1671
Add nsight support for automated debugging by @gonidelis in #1660
Format core headers by @miscco in #1670
Guard resource_ref and friends behind feature flag by @miscco in #1675
Create major version 2.5.0 by @wmaxey in #1677
Install CUB headers with .hpp extension by @bernhardmgruber in #1687
Update CMakePresets.json by @alliepiper in #1686
Fix deprecated status by @gevtushenko in #1692
Test combined internal/user-side use of NVTX by @bernhardmgruber in #1690
CI Overhaul, new nightly workflow by @alliepiper in #1654
Fix CMake option handling. by @alliepiper in #1698
Fix issues that came up with building cuDF with main by @miscco in #1643
Drop new properties until we are certain about the design by @miscco in #1681
Remove more uses of __cuda_std__ by @miscco in #1669
Fix usage of result_of in thrust by @miscco in #1705
Fix thrust::optional<T&>::emplace() by @Snektron in #1707
Remove old f(void) function signatures by @bernhardmgruber in #1708
Fix code sample in README and docs by @pauleonix in #1652
Format libcudacxx/include files without extensions by @bdice in #1676
Several improvements to zip_iterator/zip_function by @bernhardmgruber in #1710
Expose thrust's contiguous iterator unwrap helpers by @bernhardmgruber in #1717
Fix flakey heterogeneous tests by @wmaxey in #1712
Ensure that we can use cuda::std::optional with types that are not __host__ __device__ by @miscco in #1663
Fix a typo in barrier docs and update the godbolt link by @PointKernel in #1718
Massively improve test times in heterogeneous atomics tests by @wmaxey in #1719
Consolidate more common functi...

Contributors

alliepiper, trxcllnt, and 19 other contributors

Assets 2

23 Apr 21:30

wmaxey

v2.4.0

1c009d2

v2.4.0

What’s New

We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

Thrust

As part of our kernel consolidation effort, kernels of thrust::unique_by_key, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.

CUB

cub::DeviceSelect::UniqueByKey now supports equality operator and large problem sizes.
New cub::DeviceFor family of algorithms goes beyond conventional cub::DeviceFor::ForEach. cub::DeviceFor::ForEachCopy can provide you with additional performance benefits from vectorized memory accesses.
Many CUB algorithms now support CUDA graph capture mode.

libcudacxx

Added new cuda::ptx namespace with wrappers for inline-PTX instructions
cuda::std::complex specializations for CUDA types bfloat and half.

What's Changed

Implement remaining ranges iterator concepts and modernize array by @miscco in #627
Fix C++11 support of recently added tests by @ahendriksen in #651
Update CUDA newest to CTK 12.3 by @jrhemstad in #629
Add cuda::ptx::* namespace by @ahendriksen in #574
The test seems to pass just fine by @miscco in #654
Fixes discard_memory compilation failure for pre-Volta by @elstehle in #637
Reduce benchmarking time by @gevtushenko in #657
Add CCCL_VERSION and script for updating version by @jrhemstad in #652
Fixes compiler error for extended fp type data gen by @elstehle in #666
fixup ___CUDA_VPTX -> _CUDA_VPTX by @wmaxey in #664
Attempt to WAR CUB / RDC / MSVC issue by @gevtushenko in #669
Rework our system header approach to be more error proof by @miscco in #661
Project automation - fix sync action and draft setting step by @jarmak-nv in #625
Fix fallback when checking git repo by @wmaxey in #1085
Currently the verbose option does not work beacuse of a typo in the argument handling by @miscco in #1088
Adds virtual shared memory helper and tests by @elstehle in #619
Add cuda::ptx::st_async by @ahendriksen in #1078
Add cuda::ptx::red_async by @ahendriksen in #1080
Remove libcudacxx symlinks by @wmaxey in #1075
Move PTX tests that missed the symlink PR by @wmaxey in #1098
Fix truncation of constant value by @gevtushenko in #1097
Add cuda::ptx:mbarrier_{try/test}_wait{_parity} by @ahendriksen in #674
Initial CUB/NVRTC support by @gevtushenko in #1081
Fix cuda::ptx::red.async for int32_t types by @ahendriksen in #1102
Fix local test runs with lit by @miscco in #1108
Fix config when only non-CDPv1 arches are enabled. by @alliepiper in #1109
Do not replace the sccache binary for windows by @miscco in #1115
Test cuda graph capture by @gevtushenko in #1112
Fix overflow bug for >2^32 elements in thrust::shuffle by @djns99 in #1074
Introduce CUB transform reduce by @gevtushenko in #1091
Add infrastructure for compile-time CUB tests by @gevtushenko in #1124
Fix GCC6 / FP8 warning by @gevtushenko in #1130
Fix thrust transform reduce bench by @gevtushenko in #1133
Fix ptx.st.async.compile.pass.cpp failing in C++11. by @wmaxey in #1132
Fix _LIBCUDACXX_UNREACHABLE for old MSVC by @miscco in #1114
Allow filtering P0 benchmarks by @gevtushenko in #1135
Update barrier_arrive_tx.md docs by @gonzalobg in #1147
Update std iterators by @miscco in #672
Fix argument name in windows CI by @miscco in #1145
Fix XFAIL condition for subsumption tests by @miscco in #1144
Project Automation - remove draft automation + reduce permissions by @jarmak-nv in #1154
Use rst in block-scope docs by @gevtushenko in #1150
Fix errors when find_package(CCCL) is called twice. by @alliepiper in #1157
Fix icc / cub by @gevtushenko in #1152
Abort testing on unsupported dialect flags by @wmaxey in #1158
Run with latest nvbench by @robertmaynard in #583
Set finer-grain workflow permissions by @jrhemstad in #1163
Port device docs to rst by @gevtushenko in #1160
CI log improvements by @jrhemstad in #621
Setup documentation and corresponding github action by @wmaxey in #1118
Update Docs links in README.md by @wmaxey in #1169
Fix GCC 13 by @gevtushenko in #1175
Add missing exit from run-as-coder by @jrhemstad in #1176
Adds new virtual shared memory facility to DeviceMergeSort by @elstehle in #1117
Add first batch of Catch2 tests for DeviceRadixSort by @alliepiper in #1164
Implement math functions for thrust::complex by @miscco in #1178
Use anchors in matrix.yaml by @jrhemstad in #1193
Ensure the targets that Thrust creates are global. by @robertmaynard in #1182
Fix availability of is_constant_evaluated on old MSVC by @miscco in #1180
Enable std::variant for libcu++ by @miscco in #1076
Implement enable_borrowed_range by @miscco in #1196
Reduce thrust benchmarks noise by @gevtushenko in #1203
Prepare more algorithms by @miscco in #1161
Add icc compiler to CI matrix by @jrhemstad in #1159
Unify handling of dialects by @miscco in #1200
Add argument to build/test scripts for additional cmake options by @jrhemstad in #620
Move definitions of execution space macros into cccl by @miscco in #1199
Adds new virtual shared memory facility to DeviceSelect::UniqueByKey by @elstehle in #1197
Add Catch2 tests for cub::DeviceSegmentedRadixSort by @alliepiper in #1214
Fix the example on README.md by @so298 in #1220
Add missing overloads for thrust::pow by @miscco in #1222
Fix 'nvc++ -stdpar' by @dkolsen-pgi in #1224
Fix examples in reduce docs by @gevtushenko in #1230
Do not benchmark small problem sizes by @gevtushenko in #1243
Implement enable_view by @miscco in #1208
Refactors thrust::unique_by_key to use cub::DeviceSelect::UniqueByKey by @elstehle in #1245
Fix merge conflict from incoming PR by @miscco in #1250
Disable fast-math for ICC by @miscco in #1252
Fix a typo in thrust-config.cmake by @valgur in #1259
Implement ranges::{c}begin and ranges::{c}end by @miscco in #1256
Switch to entropy-based stopping criterion by @gevtushenko in #1280
Fix a sync bug in stream_ref::wait by @PointKernel in #1238
Silence some static asserts in ptx helpers by @miscco in #1257
Restore docs images...

Contributors

alliepiper, robertmaynard, and 23 other contributors

Assets 2

12 Mar 20:22

wmaxey

v2.3.2

64d3a5f

v2.3.2

What's Changed

[BACKPORT]: Silence some static asserts in ptx helpers (#1257) by @miscco in #1284
[BACKPORT]: Ensure that pair is trivially copyable (#1249) by @miscco in #1292
[BACKPORT]: Properly test internal headers (#1258) by @miscco in #1299
[Backport]: Fix errors when find_package(CCCL) is called twice. (#1157) by @miscco in #1298
[BACKPORT] Fix MSVC issues (#1261) by @miscco in #1297
[backport] thrust/mr: fix the case of reuising a block for a smaller alloc. (#1232) by @griwes in #1317
[BACKPORT]: Fix ptx usage to account for PTX ISA availability (#1359) by @miscco in #1421
Create patch 2.3.2 by @wmaxey in #1530

Full Changelog: v2.3.1...v2.3.2

Contributors

griwes, miscco, and wmaxey

Assets 2

23 Apr 21:29

wmaxey

v2.3.1

299eb62

v2.3.1

What's Changed

[BACKPORT]: Fix bug in stream_ref::wait by @miscco in #1283
Revert "Refactor thrust::complex as a struct derived from cuda::std::complex (#454)" by @miscco in #1286
Create patch 2.3.1 by @wmaxey in #1287

Full Changelog: v2.3.0...v2.3.1

Contributors

miscco and wmaxey

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

Release highlights

New C++ APIs

Fixes

What's Changed in Detail

Contributors

What’s New

C++

Thrust / CUB

Libcudacxx

Python

cuda.cooperative

cuda.parallel

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's New

What's Changed

Contributors

What’s New

Thrust

CUB

libcudacxx

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Releases: NVIDIA/cccl

v2.8.2

What's Changed

Contributors

v2.8.1

What's Changed

Contributors

CCCL 2.8.0

Release highlights

New C++ APIs

Fixes

What's Changed in Detail

Contributors

CCCL 2.7.0

What’s New

C++

Thrust / CUB

Libcudacxx

Python

cuda.cooperative

cuda.parallel

What's Changed

Contributors

CCCL 2.6.1

What's Changed

Contributors

CCCL 2.6.0

What's Changed

Contributors

CCCL 2.5.0

What's New

What's Changed

Contributors

v2.4.0

What’s New

Thrust

CUB

libcudacxx

What's Changed

Contributors

v2.3.2

What's Changed

Contributors

v2.3.1

What's Changed

Contributors