Skip to content

[BUG] Inconsistency in identifying leading modes in cutlass_torchΒ #2673

@allispaul

Description

@allispaul

Which component has the problem?

CuTe DSL

Bug Report

Describe the bug
I was unable to run the dense blockscaled gemm example on B200 with k-major, FP8 operands, and with a batch size of 1. I boiled it down to this example:

import cutlass
import cutlass.torch as cutlass_torch

l, m, k = 1, 512, 256

is_m_major = False
a_dtype = cutlass.Float8E5M2

# Create tensor A/B/C
a_ref = cutlass_torch.matrix(l, m, k, is_m_major, cutlass.Float32)

a_tensor, a_torch = cutlass_torch.cute_tensor_like(
    a_ref, a_dtype, is_dynamic_layout=True, assumed_align=16
)

which throws this error:

RuntimeError: Expected strides[leading_dim] == 1, but got 131072.

If I change l to 2 or is_m_major to True, then it works.

I think the issue is that convert_cute_tensor uses torch.Tensor.dim_order() to find the contiguous mode, but this handles size-1 modes in a funny way:

>>> x = torch.empty((512, 1, 256), dtype=torch.int8, device="cuda")
>>> x.stride()
(256, 256, 1)
>>> x.dim_order()
(0, 1, 2)
>>> y = x.permute((0, 2, 1))
>>> y.stride()
(256, 1, 256)
>>> y.dim_order()
(0, 1, 2)

On the other hand, the analogous code in cute_tensor_like looks directly at the strides to find the contiguous one.

Expected behavior
The example code should be able to construct tensors of any shape and majorness.

Environment details (please complete the following information):

  • Environment location: B200 (can replicate the above error on a lower-capability GPU)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions