Streamline group Hadamard ComputeKernel loads#2810
Streamline group Hadamard ComputeKernel loads#2810cael-ling wants to merge 13 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR refactors the
Confidence Score: 5/5This PR is safe to merge; the previously flagged stale-register bug is resolved and all execution paths are mathematically correct. All three template-path combinations have been verified: the transposed path uses a correct direct load; the pre-RHT path correctly reuses transposed fragments for a max-abs reduction that is transpose-invariant; and the identity path always reloads fresh data, directly addressing the previously raised concern. No new logic defects, data-integrity issues, or correctness risks are introduced. No files require special attention Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
Entry["ComputeKernel(b_frag_i, b_frag_t, in_sh_ptr, swizzle_idx, ...)"]
Entry --> T{kReturnTransposedAmax?}
T -- yes --> TL["ldmatrix_x4 transposed load → a_frag[0..3]"]
TL --> TMMA["MMA (transposed layout)\na_frag[0,2,1,3] × b_frag_t → c_frag\nupdate local_amax_t_reg"]
TMMA --> P
T -- no --> P{kReturnPreRhtAmax?}
P -- yes, Transposed ran --> PMax["max-abs reduction over a_frag[0..3]\n(transposed frags; result is transpose-invariant)\nupdate local_pre_rht_amax_reg"]
P -- yes, Transposed did NOT run --> PR["ldmatrix_x4 row-major load → a_frag[0..3]"]
PR --> PMax
PMax --> I
P -- no --> I{kReturnIdentityAmax?}
I -- yes --> IL["ldmatrix_x4 row-major load → a_frag[0..3]\n(unconditional fresh reload)"]
IL --> IMMA["MMA (identity layout)\na_frag[0..3] × b_frag_i → c_frag\nupdate local_amax_reg"]
IMMA --> Done["return"]
I -- no --> Done
Reviews (5): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
| if (kReturnTransposedAmax || (!kReturnTransposedAmax && !kReturnPreRhtAmax)) { | ||
| ldmatrix_x4_m8n8_shared_b16<false>(a_frag[0], a_frag[1], a_frag[2], a_frag[3], | ||
| reinterpret_cast<uint4*>(in_sh_ptr) + swizzle_idx); | ||
| } |
There was a problem hiding this comment.
Stale
a_frag used for identity MMA when only Pre-RHT + Identity are enabled
The condition on line 93 simplifies via Boolean algebra to kReturnTransposedAmax || !kReturnPreRhtAmax, which means the reload is skipped when kReturnTransposedAmax=false and kReturnPreRhtAmax=true.
In that code path the pre-RHT block just ran (lines 72–90) and left a_frag[0] and a_frag[2] overwritten with intermediate max-reduction results:
a_frag[0] ← max(|a_frag[0]|, |a_frag[1]|) // line 79
a_frag[2] ← max(|a_frag[2]|, |a_frag[3]|) // line 82
a_frag[0] ← max(|a_frag[0]|, |a_frag[2]|) // line 85
When the identity MMA then runs without a fresh load, it consumes these scalar amax values instead of the original matrix fragment data, producing an incorrect identity amax.
Because a fresh row-major load is required in every caller configuration (transposed data from the kReturnTransposedAmax branch is unusable, and kReturnPreRhtAmax corrupts registers regardless), the guard should be dropped and the load made unconditional:
| if (kReturnTransposedAmax || (!kReturnTransposedAmax && !kReturnPreRhtAmax)) { | |
| ldmatrix_x4_m8n8_shared_b16<false>(a_frag[0], a_frag[1], a_frag[2], a_frag[3], | |
| reinterpret_cast<uint4*>(in_sh_ptr) + swizzle_idx); | |
| } | |
| ldmatrix_x4_m8n8_shared_b16<false>(a_frag[0], a_frag[1], a_frag[2], a_frag[3], | |
| reinterpret_cast<uint4*>(in_sh_ptr) + swizzle_idx); |
There was a problem hiding this comment.
valid concern, can you take a look? @cael-ling
There was a problem hiding this comment.
Updated kReturnIdentityAmax path: if it is true, perform one extra reload of the values to guarantee correct behavior. @zhongbozhu
Signed-off-by: Cael Ling <caell@nvidia.com>
Signed-off-by: Cael Ling <caell@nvidia.com>
for more information, see https://pre-commit.ci
…ranspose' into refactor/grp-hadamard-ldmatrix-transpose Made-with: Cursor
cc3e5f5 to
d90ac01
Compare
Signed-off-by: Cael Ling <caell@nvidia.com>
| } | ||
|
|
||
| if (kReturnIdentityAmax) { | ||
| if (kReturnTransposedAmax || (!kReturnTransposedAmax && !kReturnPreRhtAmax)) { |
There was a problem hiding this comment.
the double if looks confusing
There was a problem hiding this comment.
let's not mix the three cases, we should be able to use constexpr to remove the overhead of if, so having duplicated code makes it more readable.
Signed-off-by: Cael Ling <caell@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Cael Ling <caell@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Cael Ling <caell@nvidia.com>
…com:cael-ling/TransformerEngine into refactor/grp-hadamard-ldmatrix-transpose
…:cael-ling/TransformerEngine into refactor/grp-hadamard-ldmatrix-transpose
for more information, see https://pre-commit.ci
|
The change has been applied to variants:(group_hadamard_transform.cu/hadamard_trnsform.cu/graph_safe_group_hadamard_transform.cu) |
Description
Superseded: This work has been rolled into #2820; please review that PR instead.
Reorders ComputeKernel to Transposed → Pre-RHT → Identity (per enabled kReturn* flags).
Transposed path: uses ldmatrix_x4_m8n8_shared_b16 instead of row-major load plus four in-register transposes before the same WMMA operand pattern. This reduces instruction count and warp-synchronous work on the hot path, improving performance.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: