Skip to content

Optimized Device-to-Device Tensor Copy (cudax)#7823

Open
fbusato wants to merge 145 commits intoNVIDIA:mainfrom
fbusato:SoL-d2d-copy_bytes
Open

Optimized Device-to-Device Tensor Copy (cudax)#7823
fbusato wants to merge 145 commits intoNVIDIA:mainfrom
fbusato:SoL-d2d-copy_bytes

Conversation

@fbusato
Copy link
Copy Markdown
Contributor

@fbusato fbusato commented Feb 27, 2026

Description

Provide an optimized version of device-to-device copy between two multi-dimensional tensors with compatible extents and arbitrary strides.
The feature is experimentally based on cuTe (CUTLASS). We need to evaluate if it is possible to remove such dependency without reimplementing cuTe in CCCL before production. The functionality has been reimplemented without CUTLASS/cuTe.

The code contains the following optimizations:

  1. Fast path for "empty" tensors and scalar.
  2. Remove 1-size strides.
  3. Stride sorting.
  4. Optimize negative strides.
  5. Tensor dimensions coalescing.
  6. Dispatch contiguous tensors to cub::DeviceTransform.
  7. Vectorization.
  8. Tiling.
  9. Common kernel optimizations: __restrict__, __grid_constant__.

To explore:

  • Shared memory/TMA for tensor "transpose" use case.

The PR has been rebased from #7676 (prerequisite) .

The PR contains:

  • Execution code.
  • Unit tests: basic, edge cases, cuTe vectorization, from nvmath.
  • Benchmarks.
  • documentation.

Requires #8095

fbusato and others added 30 commits January 27, 2026 14:57
* Add native type system for cuda.compute

* Add JIT infrastructure and intrinsics

* Decouple struct.py from numba

* Decouple core interop infrastructure from Numba

* Decouple iterator type system from numba

* Decouple algorithms from numba type system

* Move iterator type inference logic to _jit.py

* Some items from review

* Bump copyright

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
* Add runtime check if memory pools are supported

* Fix 12.X build

* Fix typo

* Also apply to is_pointer_accessible test

* Fix extra assert

* I love MSVC

---------

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
Copy link
Copy Markdown
Contributor

@NaderAlAwar NaderAlAwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comments I left reveal bugs, could we also add tests that expose them?

}
}

inline constexpr int __bytes_in_flight = 64 * 1024; // 64KB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: we have arch_to_min_bytes_in_flight in tuning_transform.cuh where this exact value is repeated. Is it possible to reuse this here? Or at least have a common helper somewhere both can use?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's actually a good point. On the other hand, I'm worried that arch_to_min_bytes_in_flight is specifically tuned for cub::DeviceTransform . Also, this would mean introducing a policy chain to get the value at compile-time, which is pretty invasive. @bernhardmgruber any thought?

@fbusato
Copy link
Copy Markdown
Contributor Author

fbusato commented Apr 10, 2026

/ok to test 9f9466e

@github-actions

This comment has been minimized.

@fbusato
Copy link
Copy Markdown
Contributor Author

fbusato commented Apr 10, 2026

/ok to test 5796e40

Comment on lines +121 to +129
[[nodiscard]] _CCCL_HOST_API inline int __bytes_in_flight() noexcept
{
const auto __dev_id = ::cuda::__driver::__cudevice_to_ordinal(::cuda::__driver::__ctxGetDevice());
const auto __dev = ::cuda::devices[__dev_id];
const auto __major = __dev.attribute<::cudaDevAttrComputeCapabilityMajor>();
const auto __minor = __dev.attribute<::cudaDevAttrComputeCapabilityMinor>();
const auto __arch = ::cuda::arch_id{__major * 10 + __minor};
return CUB_NS_QUALIFIER::detail::transform::arch_to_min_bytes_in_flight(__arch);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pciolkosz would be nice to have a utility to avoid all these calls every time

@fbusato fbusato requested a review from NaderAlAwar April 11, 2026 00:01
@github-actions

This comment has been minimized.

Comment on lines +161 to +162
auto src_ptr = thrust::raw_pointer_cast(d_src.data()) + src_offset;
auto dst_ptr = thrust::raw_pointer_cast(d_dst.data()) + dst_offset;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: I think this layout_stride_relaxed construction is inconsistent with the documented model. Here the pointer is shifted by src_offset/dst_offset, but the mapping itself is still created with offset == 0. The docs describe layout_stride_relaxed the other way around: keep the data pointer at the base, and store the compensation in mapping.offset() so mapping(indices...) = offset + sum(index_i * stride_i) remains nonnegative and required_span_size() reflects the actual span, especially for negative strides. See https://github.com/NVIDIA/cccl/blob/main/docs/libcudacxx/extended_api/mdspan/dlpack_to_mdspan.rst#semantics. As written, this helper seems to encode the offset in the pointer instead of the mapping.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. The PR was created before layout_stride_relaxed was merged.

@fbusato
Copy link
Copy Markdown
Contributor Author

fbusato commented Apr 13, 2026

/ok to test b77dafe

@fbusato fbusato enabled auto-merge (squash) April 13, 2026 18:01
@github-actions

This comment has been minimized.

@fbusato
Copy link
Copy Markdown
Contributor Author

fbusato commented Apr 14, 2026

/ok to test 8f9a243

@github-actions
Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 31m 02s: Pass: 92%/53 | Total: 11h 40m | Max: 31m 02s | Hits: 98%/29863

See results here.

@fbusato
Copy link
Copy Markdown
Contributor Author

fbusato commented Apr 14, 2026

/ok to test 224c50c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudax Feature intended for the cudax experimental library

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

8 participants