[Common] Reduced padding kernel compilation time by Oleg-Goncharov · Pull Request #2827 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-04-02T14:56:51Z

Description

This PR reduces the compilation time of padding.cu from approximately 600 seconds to 15 seconds by lowering the outer-loop unroll factor.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Changed the outer #pragma unroll directive from 8 to 4.
Reduced compile-time overhead in the padding kernel.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps · 2026-04-02T14:58:17Z

Greptile Summary

This PR reduces padding.cu compilation time from ~600s to ~15s by changing #pragma unroll (full unroll) to #pragma unroll 4 in the outer loop of both multi_padding_kernel and multi_unpadding_kernel. Since n_iterations is a compile-time constant of 8 (THREADS_PER_WARP / n_warps_per_tile = 32 / 4), the original bare directive caused the compiler to fully unroll all 8 iterations across every template instantiation in TRANSFORMER_ENGINE_TYPE_SWITCH_ALL, producing massive code and very long compile times. Halving the unroll factor to 4 is a sensible trade-off with negligible runtime impact on these memory-bandwidth-bound kernels.

Confidence Score: 5/5

Safe to merge — minimal two-line change with no correctness risk and a large compile-time benefit.

n_iterations is a compile-time constant of 8, so #pragma unroll 4 produces two clean groups of 4 iterations with correct semantics. The kernels are memory-bandwidth-bound so the reduced unroll factor has negligible runtime impact. No logic, data, or API changes were made.

No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/common/util/padding.cu	Changed `#pragma unroll` to `#pragma unroll 4` in both `multi_padding_kernel` and `multi_unpadding_kernel`; reduces compile time ~40x with minimal expected runtime impact.

_{Reviews (1): Last reviewed commit: "Reduced padding kernel compilation time" | Re-trigger Greptile}

Reduced padding kernel compilation time

0136e94

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov requested a review from ptrendx April 2, 2026 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Reduced padding kernel compilation time#2827

[Common] Reduced padding kernel compilation time#2827
Oleg-Goncharov wants to merge 1 commit intoNVIDIA:mainfrom
Oleg-Goncharov:pr_reduced_padding_kernel_compilation

Oleg-Goncharov commented Apr 2, 2026

Uh oh!

greptile-apps bot commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oleg-Goncharov commented Apr 2, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Apr 2, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant