Optimized Device-to-Device Tensor Copy (cudax)#7823
Optimized Device-to-Device Tensor Copy (cudax)#7823fbusato wants to merge 145 commits intoNVIDIA:mainfrom
cudax)#7823Conversation
* Add native type system for cuda.compute * Add JIT infrastructure and intrinsics * Decouple struct.py from numba * Decouple core interop infrastructure from Numba * Decouple iterator type system from numba * Decouple algorithms from numba type system * Move iterator type inference logic to _jit.py * Some items from review * Bump copyright --------- Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>
* Add runtime check if memory pools are supported * Fix 12.X build * Fix typo * Also apply to is_pointer_accessible test * Fix extra assert * I love MSVC --------- Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
NaderAlAwar
left a comment
There was a problem hiding this comment.
If the comments I left reveal bugs, could we also add tests that expose them?
| } | ||
| } | ||
|
|
||
| inline constexpr int __bytes_in_flight = 64 * 1024; // 64KB |
There was a problem hiding this comment.
Question: we have arch_to_min_bytes_in_flight in tuning_transform.cuh where this exact value is repeated. Is it possible to reuse this here? Or at least have a common helper somewhere both can use?
There was a problem hiding this comment.
that's actually a good point. On the other hand, I'm worried that arch_to_min_bytes_in_flight is specifically tuned for cub::DeviceTransform . Also, this would mean introducing a policy chain to get the value at compile-time, which is pretty invasive. @bernhardmgruber any thought?
|
/ok to test 9f9466e |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 5796e40 |
| [[nodiscard]] _CCCL_HOST_API inline int __bytes_in_flight() noexcept | ||
| { | ||
| const auto __dev_id = ::cuda::__driver::__cudevice_to_ordinal(::cuda::__driver::__ctxGetDevice()); | ||
| const auto __dev = ::cuda::devices[__dev_id]; | ||
| const auto __major = __dev.attribute<::cudaDevAttrComputeCapabilityMajor>(); | ||
| const auto __minor = __dev.attribute<::cudaDevAttrComputeCapabilityMinor>(); | ||
| const auto __arch = ::cuda::arch_id{__major * 10 + __minor}; | ||
| return CUB_NS_QUALIFIER::detail::transform::arch_to_min_bytes_in_flight(__arch); | ||
| } |
There was a problem hiding this comment.
@pciolkosz would be nice to have a utility to avoid all these calls every time
This comment has been minimized.
This comment has been minimized.
cudax/test/copy/copy_common.cuh
Outdated
| auto src_ptr = thrust::raw_pointer_cast(d_src.data()) + src_offset; | ||
| auto dst_ptr = thrust::raw_pointer_cast(d_dst.data()) + dst_offset; |
There was a problem hiding this comment.
Important: I think this layout_stride_relaxed construction is inconsistent with the documented model. Here the pointer is shifted by src_offset/dst_offset, but the mapping itself is still created with offset == 0. The docs describe layout_stride_relaxed the other way around: keep the data pointer at the base, and store the compensation in mapping.offset() so mapping(indices...) = offset + sum(index_i * stride_i) remains nonnegative and required_span_size() reflects the actual span, especially for negative strides. See https://github.com/NVIDIA/cccl/blob/main/docs/libcudacxx/extended_api/mdspan/dlpack_to_mdspan.rst#semantics. As written, this helper seems to encode the offset in the pointer instead of the mapping.
There was a problem hiding this comment.
good point. The PR was created before layout_stride_relaxed was merged.
|
/ok to test b77dafe |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 8f9a243 |
😬 CI Workflow Results🟥 Finished in 31m 02s: Pass: 92%/53 | Total: 11h 40m | Max: 31m 02s | Hits: 98%/29863See results here. |
|
/ok to test 224c50c |
Description
Provide an optimized version of device-to-device copy between two multi-dimensional tensors with compatible extents and arbitrary strides.
The feature is experimentally based on cuTe (CUTLASS). We need to evaluate if it is possible to remove such dependency without reimplementing cuTe in CCCL before production.The functionality has been reimplemented without CUTLASS/cuTe.The code contains the following optimizations:
cub::DeviceTransform.__restrict__,__grid_constant__.To explore:
The PR has been rebased from #7676 (prerequisite) .
The PR contains:
Requires #8095