Skip to content

[Common] Reduced padding kernel compilation time#2827

Open
Oleg-Goncharov wants to merge 1 commit intoNVIDIA:mainfrom
Oleg-Goncharov:pr_reduced_padding_kernel_compilation
Open

[Common] Reduced padding kernel compilation time#2827
Oleg-Goncharov wants to merge 1 commit intoNVIDIA:mainfrom
Oleg-Goncharov:pr_reduced_padding_kernel_compilation

Conversation

@Oleg-Goncharov
Copy link
Copy Markdown
Collaborator

Description

This PR reduces the compilation time of padding.cu from approximately 600 seconds to 15 seconds by lowering the outer-loop unroll factor.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Changed the outer #pragma unroll directive from 8 to 4.
  • Reduced compile-time overhead in the padding kernel.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov requested a review from ptrendx April 2, 2026 14:57
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 2, 2026

Greptile Summary

This PR reduces padding.cu compilation time from ~600s to ~15s by changing #pragma unroll (full unroll) to #pragma unroll 4 in the outer loop of both multi_padding_kernel and multi_unpadding_kernel. Since n_iterations is a compile-time constant of 8 (THREADS_PER_WARP / n_warps_per_tile = 32 / 4), the original bare directive caused the compiler to fully unroll all 8 iterations across every template instantiation in TRANSFORMER_ENGINE_TYPE_SWITCH_ALL, producing massive code and very long compile times. Halving the unroll factor to 4 is a sensible trade-off with negligible runtime impact on these memory-bandwidth-bound kernels.

Confidence Score: 5/5

Safe to merge — minimal two-line change with no correctness risk and a large compile-time benefit.

n_iterations is a compile-time constant of 8, so #pragma unroll 4 produces two clean groups of 4 iterations with correct semantics. The kernels are memory-bandwidth-bound so the reduced unroll factor has negligible runtime impact. No logic, data, or API changes were made.

No files require special attention.

Important Files Changed

Filename Overview
transformer_engine/common/util/padding.cu Changed #pragma unroll to #pragma unroll 4 in both multi_padding_kernel and multi_unpadding_kernel; reduces compile time ~40x with minimal expected runtime impact.

Reviews (1): Last reviewed commit: "Reduced padding kernel compilation time" | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant