Why Does Auto-Vectorizing Copy Use ld.global.v2.u64 Instead of ld.global.v4.b32 #2552

N1GHTR0 · 2025-08-07T15:51:01Z

N1GHTR0
Aug 7, 2025

Hi, I have a question regarding vectorized loads in copy().

I have a float register tensor (Trm) with shape 4 and stride 1, and a float global memory tensor (Tgm), also with shape 4 and stride 1. When I do a copy(Trm, Tgm), the generated PTX uses:

ld.global.v2.u64
I've been reading through copy.hpp, especially trying to understand the logic behind AutoVectorizingCopyWithAssumedAlignment. However, I’m still not sure why the code generation ends up choosing ld.global.v2.u64 instead of, say, ld.global.v4.b32.

What I’d like to ask is:

Is there a way to force the copy() to use ld.global.v4.b32 instead — perhaps by passing a custom CopyAtom or another object?

Is this decision made entirely at compile-time, or is there something in the runtime tensor metadata that affects it?

Is there any performance difference between ld.global.v2.u64 and ld.global.v4.b32? They both read 128 bits, but are there cases where one would be preferable?

Thanks in advance!

Answered by thakkarV

Aug 12, 2025

Is this decision made entirely at compile-time, or is there something in the runtime tensor metadata that affects it?

no, fully static decisions. you can have branches in your program to dispatch to different copy loops however.

Is there any performance difference between ld.global.v2.u64 and ld.global.v4.b32? They both read 128 bits, but are there cases where one would be preferable?

Not sure, but as long as they compile to the SASS it should not matter.

View full answer

thakkarV · 2025-08-12T15:02:42Z

thakkarV
Aug 12, 2025
Collaborator

Is this decision made entirely at compile-time, or is there something in the runtime tensor metadata that affects it?

no, fully static decisions. you can have branches in your program to dispatch to different copy loops however.

Is there any performance difference between ld.global.v2.u64 and ld.global.v4.b32? They both read 128 bits, but are there cases where one would be preferable?

Not sure, but as long as they compile to the SASS it should not matter.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why Does Auto-Vectorizing Copy Use ld.global.v2.u64 Instead of ld.global.v4.b32 #2552

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Why Does Auto-Vectorizing Copy Use ld.global.v2.u64 Instead of ld.global.v4.b32 #2552

Uh oh!

N1GHTR0 Aug 7, 2025

Replies: 1 comment

Uh oh!

Uh oh!

thakkarV Aug 12, 2025 Collaborator

N1GHTR0
Aug 7, 2025

thakkarV
Aug 12, 2025
Collaborator