-
Hi, I have a question regarding vectorized loads in copy(). I have a float register tensor (Trm) with shape 4 and stride 1, and a float global memory tensor (Tgm), also with shape 4 and stride 1. When I do a copy(Trm, Tgm), the generated PTX uses: ld.global.v2.u64 What I’d like to ask is: Is there a way to force the copy() to use ld.global.v4.b32 instead — perhaps by passing a custom CopyAtom or another object? Is this decision made entirely at compile-time, or is there something in the runtime tensor metadata that affects it? Is there any performance difference between ld.global.v2.u64 and ld.global.v4.b32? They both read 128 bits, but are there cases where one would be preferable? Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
no, fully static decisions. you can have branches in your program to dispatch to different copy loops however.
Not sure, but as long as they compile to the SASS it should not matter. |
Beta Was this translation helpful? Give feedback.
no, fully static decisions. you can have branches in your program to dispatch to different copy loops however.
Not sure, but as long as they compile to the SASS it should not matter.