Optimized Device-to-Device Tensor Copy (`cudax`) by fbusato · Pull Request #7823 · NVIDIA/cccl

fbusato · 2026-02-27T00:28:53Z

Description

Provide an optimized version of device-to-device copy between two multi-dimensional tensors with compatible extents and arbitrary strides.
~~The feature is experimentally based on cuTe (CUTLASS). We need to evaluate if it is possible to remove such dependency without reimplementing cuTe in CCCL before production.~~ The functionality has been reimplemented without CUTLASS/cuTe.

The code contains the following optimizations:

Fast path for "empty" tensors and scalar.
Remove 1-size strides.
Stride sorting.
Optimize negative strides.
Tensor dimensions coalescing.
Dispatch contiguous tensors to cub::DeviceTransform.
Vectorization.
Tiling.
Common kernel optimizations: __restrict__, __grid_constant__.

To explore:

Shared memory/TMA for tensor "transpose" use case.

The PR has been rebased from #7676 (prerequisite) .

The PR contains:

Execution code.
Unit tests: basic, edge cases, cuTe vectorization, from nvmath.
Benchmarks.
documentation.

Requires #8095

* Add native type system for cuda.compute * Add JIT infrastructure and intrinsics * Decouple struct.py from numba * Decouple core interop infrastructure from Numba * Decouple iterator type system from numba * Decouple algorithms from numba type system * Move iterator type inference logic to _jit.py * Some items from review * Bump copyright --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

* Add runtime check if memory pools are supported * Fix 12.X build * Fix typo * Also apply to is_pointer_accessible test * Fix extra assert * I love MSVC --------- Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

cudax/include/cuda/experimental/__copy/mdspan_d2d.cuh

NaderAlAwar

If the comments I left reveal bugs, could we also add tests that expose them?

NaderAlAwar · 2026-04-09T13:54:24Z

cudax/include/cuda/experimental/__copy/copy_contiguous.cuh

+  }
+}
+
+inline constexpr int __bytes_in_flight = 64 * 1024; // 64KB


Question: we have arch_to_min_bytes_in_flight in tuning_transform.cuh where this exact value is repeated. Is it possible to reuse this here? Or at least have a common helper somewhere both can use?

that's actually a good point. On the other hand, I'm worried that arch_to_min_bytes_in_flight is specifically tuned for cub::DeviceTransform . Also, this would mean introducing a policy chain to get the value at compile-time, which is pretty invasive. @bernhardmgruber any thought?

cudax/include/cuda/experimental/__copy/mdspan_d2d.cuh

cudax/include/cuda/experimental/__copy/copy_contiguous.cuh

cudax/include/cuda/experimental/__copy/tensor_copy_utils.cuh

fbusato · 2026-04-10T22:27:55Z

/ok to test 9f9466e

fbusato · 2026-04-10T23:57:15Z

/ok to test 5796e40

fbusato · 2026-04-11T00:00:01Z

cudax/include/cuda/experimental/__copy/copy_contiguous.cuh

+[[nodiscard]] _CCCL_HOST_API inline int __bytes_in_flight() noexcept
+{
+  const auto __dev_id = ::cuda::__driver::__cudevice_to_ordinal(::cuda::__driver::__ctxGetDevice());
+  const auto __dev    = ::cuda::devices[__dev_id];
+  const auto __major  = __dev.attribute<::cudaDevAttrComputeCapabilityMajor>();
+  const auto __minor  = __dev.attribute<::cudaDevAttrComputeCapabilityMinor>();
+  const auto __arch   = ::cuda::arch_id{__major * 10 + __minor};
+  return CUB_NS_QUALIFIER::detail::transform::arch_to_min_bytes_in_flight(__arch);
+}


@pciolkosz would be nice to have a utility to avoid all these calls every time

NaderAlAwar · 2026-04-13T14:19:01Z

cudax/test/copy/copy_common.cuh

+  auto src_ptr = thrust::raw_pointer_cast(d_src.data()) + src_offset;
+  auto dst_ptr = thrust::raw_pointer_cast(d_dst.data()) + dst_offset;


Important: I think this layout_stride_relaxed construction is inconsistent with the documented model. Here the pointer is shifted by src_offset/dst_offset, but the mapping itself is still created with offset == 0. The docs describe layout_stride_relaxed the other way around: keep the data pointer at the base, and store the compensation in mapping.offset() so mapping(indices...) = offset + sum(index_i * stride_i) remains nonnegative and required_span_size() reflects the actual span, especially for negative strides. See https://github.com/NVIDIA/cccl/blob/main/docs/libcudacxx/extended_api/mdspan/dlpack_to_mdspan.rst#semantics. As written, this helper seems to encode the offset in the pointer instead of the mapping.

good point. The PR was created before layout_stride_relaxed was merged.

cudax/include/cuda/experimental/__copy/mdspan_d2d.cuh

fbusato · 2026-04-13T17:47:14Z

/ok to test b77dafe

fbusato · 2026-04-14T00:35:37Z

/ok to test 8f9a243

github-actions · 2026-04-14T01:08:19Z

😬 CI Workflow Results

🟥 Finished in 31m 02s: Pass: 92%/53 | Total: 11h 40m | Max: 31m 02s | Hits: 98%/29863

See results here.

fbusato · 2026-04-14T15:35:22Z

/ok to test 224c50c

fbusato and others added 30 commits January 27, 2026 14:57

mdspan to cute

94902a7

copy prototype

e68c7b1

minor fixes

2ef97ab

minor improvements

83499bb

working version

fbccf0f

__max_common_layout and __logical_divide_dynamic

74dbef3

improve readability

5d593ff

unit test working

60a081b

strided layout draft

5fe244d

completed strided layout

36e1bfc

tentative logical/strided-ordered

a4d89db

working version

f76fcd3

refactoring

162916e

prepare to review

1703897

mdspan to cute

9709c40

copy prototype

e68da78

minor fixes

ecff760

minor improvements

fb7e2b3

working version

9682d02

__max_common_layout and __logical_divide_dynamic

96daa66

improve readability

2c1dddc

unit test working

17c5b22

strided layout draft

1be5769

completed strided layout

539e574

tentative logical/strided-ordered

eaf0856

working version

58fe5ea

refactoring

6786917

prepare to review

dac9b51

fix MSVC warnings

f38f7b6

fbusato commented Apr 8, 2026

View reviewed changes

cudax/include/cuda/experimental/__copy/mdspan_d2d.cuh Outdated Show resolved Hide resolved

NaderAlAwar reviewed Apr 9, 2026

View reviewed changes

fbusato added 2 commits April 9, 2026 18:14

address some comments

794e5c9

address comments

9f9466e

This comment has been minimized.

Sign in to view

fbusato added 2 commits April 10, 2026 16:03

contiguous kernel remainder

8c241f9

add tuning/heuristic

5796e40

fbusato commented Apr 11, 2026

View reviewed changes

fbusato requested a review from NaderAlAwar April 11, 2026 00:01

This comment has been minimized.

Sign in to view

NaderAlAwar reviewed Apr 13, 2026

View reviewed changes

fbusato and others added 5 commits April 13, 2026 10:39

fix clang-cuda warning

ec0b796

fix operator() return type

9fa9705

improve layout_stride_relaxed offset handling in test

1f3aa77

const correctness

482b4b1

Merge branch 'main' into SoL-d2d-copy_bytes

b77dafe

NaderAlAwar approved these changes Apr 13, 2026

View reviewed changes

fbusato enabled auto-merge (squash) April 13, 2026 18:01

This comment has been minimized.

Sign in to view

fbusato added 3 commits April 13, 2026 17:12

fix deduction guide

9192dff

fix MSVC alignment parameter

77f815d

formatting

8f9a243

fix MSVC warning

224c50c

		auto src_ptr = thrust::raw_pointer_cast(d_src.data()) + src_offset;
		auto dst_ptr = thrust::raw_pointer_cast(d_dst.data()) + dst_offset;

Conversation

fbusato commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

NaderAlAwar left a comment

Choose a reason for hiding this comment

Uh oh!

NaderAlAwar Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

fbusato Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fbusato commented Apr 10, 2026

Uh oh!

This comment has been minimized.

fbusato commented Apr 10, 2026

Uh oh!

fbusato Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

NaderAlAwar Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

fbusato Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fbusato commented Apr 13, 2026

Uh oh!

This comment has been minimized.

fbusato commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

😬 CI Workflow Results

🟥 Finished in 31m 02s: Pass: 92%/53 | Total: 11h 40m | Max: 31m 02s | Hits: 98%/29863

Uh oh!

fbusato commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

fbusato commented Feb 27, 2026 •

edited

Loading