-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Labels
Description
Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
I was unable to run the dense blockscaled gemm example on B200 with k-major, FP8 operands, and with a batch size of 1. I boiled it down to this example:
import cutlass
import cutlass.torch as cutlass_torch
l, m, k = 1, 512, 256
is_m_major = False
a_dtype = cutlass.Float8E5M2
# Create tensor A/B/C
a_ref = cutlass_torch.matrix(l, m, k, is_m_major, cutlass.Float32)
a_tensor, a_torch = cutlass_torch.cute_tensor_like(
a_ref, a_dtype, is_dynamic_layout=True, assumed_align=16
)
which throws this error:
RuntimeError: Expected strides[leading_dim] == 1, but got 131072.
If I change l
to 2 or is_m_major
to True, then it works.
I think the issue is that convert_cute_tensor
uses torch.Tensor.dim_order()
to find the contiguous mode, but this handles size-1 modes in a funny way:
>>> x = torch.empty((512, 1, 256), dtype=torch.int8, device="cuda")
>>> x.stride()
(256, 256, 1)
>>> x.dim_order()
(0, 1, 2)
>>> y = x.permute((0, 2, 1))
>>> y.stride()
(256, 1, 256)
>>> y.dim_order()
(0, 1, 2)
On the other hand, the analogous code in cute_tensor_like
looks directly at the strides to find the contiguous one.
Expected behavior
The example code should be able to construct tensors of any shape and majorness.
Environment details (please complete the following information):
- Environment location: B200 (can replicate the above error on a lower-capability GPU)
jayhshah