Replies: 2 comments 5 replies
-
|
I am open to discussion extending the tuple protocol to more types. I am a bit wary about the pragma unroll suggestion, because it is a nontrivial change to the code and might also affect e.g. binary size |
Beta Was this translation helpful? Give feedback.
-
This makes a lot of sense. Considering also that the maximum access size exposed in CUDA 12.9 - PTX is 32 bytes, so even
There are situations where the compiler doesn't actually unroll a loop, even with |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I just saw #4674 from @davebayer introducing the tuple protocol for vector types which sounds great for generic/templated (tunable) device code.
Any plans for supporting vector types that are not part of CUDA because they have more than 4 elements but still would fit within vectorized loads, like e.g.
char16? Using the tuple protocol these should fit in nicely.Also, for generic code using the tuple protocol an unrolled loop template like https://stackoverflow.com/a/46873787/10107454 makes sense (as it gives actual
constexprloop indices unlike#pragma unroll), so maybe that could also be part of libcu++'s extended API?Beta Was this translation helpful? Give feedback.
All reactions