-
To implement some generic kernels that support arbitrary problem sizes, sometimes we will have to rely on using predicates to decide whether certain operation has to be performed in each thread. Typically we will first construct an identity tensor whose size seems to be large in the thread, and then we partition the tensor for each thread based on the thread id. Tensor identity = make_identity_tensor(shape(mC));
// Partition. Got smaller coordinate tensor for each thread.
// ... My question is, suppose the Because accessing local memory has the same performance as accessing global memory, for some memory bound kernels, it will reduce the performance to some extent. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
These are implicit Tensors that have no storage. There will be very few URs or Registers to possibly represent the offset, but otherwise there is no physical storage and the elements of the tensor are generated on-the-fly. https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/cute/0z_tma_tensors.md |
Beta Was this translation helpful? Give feedback.
These are implicit Tensors that have no storage. There will be very few URs or Registers to possibly represent the offset, but otherwise there is no physical storage and the elements of the tensor are generated on-the-fly.
https://github.com/NVIDIA/cutlass/blob/main/media/docs/cpp/cute/0z_tma_tensors.md