GEMM + Swiglu fused Grouped MLP for MXFP8 by ksivaman · Pull Request #2769 · NVIDIA/TransformerEngine

ksivaman · 2026-03-17T01:31:27Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps · 2026-03-17T01:38:14Z

Greptile Summary

This PR introduces a fused GEMM + SwiGLU kernel for MXFP8 Grouped MLP on SM100 (Blackwell) GPUs, using CuTe DSL kernels from cuDNN front-end. It adds ForwardGroupedMLP_CuTeGEMMSwiGLU_MXFP8 and BackwardGroupedMLP_CuTeGEMMDSwiGLU_MXFP8 fused ops, supporting both stacked-weight and discrete-weight variants, along with GroupedTensor / GroupedTensorStorage extensions for swizzled-scale handling.

Many correctness concerns (non-contiguous .view() on quant kernel outputs, stale grouped_fc1_dy scale layout, mark_grouped_tensor assertion on frozen weights, overwrite_main_grad + single_grouped_weight silent wgrad drop, and meta-device AttributeError) remain open from earlier review rounds and are not yet fixed in the current head.

Confidence Score: 2/5

Not safe to merge — multiple prior-round P1 correctness issues remain present in the current head with no fix.

Several confirmed runtime-crash or silent-wrong-result bugs flagged in earlier review rounds remain unaddressed: (1) fc1_dgrad_kernel_out[d_tensor].view(in_shape) will raise RuntimeError because the quant kernel output requires .permute(2,0,1) before .view(); (2) grouped_fc1_dy data fields are 2-D and scales are unpermuted, producing wrong wgrad; (3) mark_grouped_tensor hard-asserts columnwise_data is not None, crashing on frozen-weight + grad-input; (4) overwrite_main_grad=True + single_grouped_weight=True silently writes wgrad to a scratch buffer, never updating main_grad.

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py (lines 504-516, 611) and transformer_engine/pytorch/utils.py (mark_grouped_tensor) need the most attention before merge.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py	New file implementing the fused MXFP8 GEMM+SwiGLU forward pass; several correctness issues remain open (non-contiguous .view() on kernel outputs, unconditional bias_tensor kwarg, c_dtype hardcoded to bfloat16).
transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py	New file implementing the fused MXFP8 dGLU + dgrad backward pass; multiple correctness issues remain open from prior rounds (non-contiguous d_tensor .view(), stale grouped_fc1_dy scale layout, overwrite_main_grad dropping wgrad).
transformer_engine/pytorch/ops/_common.py	Adds fuse_grouped_mlp_ops sliding-window fusion helper and validate_grouped_mlp_dims; logic is correct but bias guard only blocks fusion when has_bias=True, leaving a TypeError path for has_bias=False + missing kwarg.
transformer_engine/pytorch/ops/basic/grouped_linear.py	Adds single_grouped_weight/bias support, make_grouped_weights, and delay_wgrad hooks; meta-device + single_grouped_weight + delay_wgrad AttributeError remains open.
transformer_engine/pytorch/utils.py	Adds get_cached_ones_tensor and mark_grouped_tensor helpers; the unconditional assertion in mark_grouped_tensor still fires when columnwise_data is None (frozen-weight + grad-input scenario).
transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py	Adds with_gemm_swizzled_scales flag and extended constructor; unreachable line after return in new was flagged and fixed, storage now correctly initialised via _initialize_storage_fields.
transformer_engine/pytorch/csrc/extensions/swizzle.cpp	New C++ helpers for per-tensor and grouped scale swizzling; grouped_swizzle_for_gemm correctly operates on a Python-side copy and nullifies the unused scale direction.
transformer_engine/pytorch/csrc/extensions/utils.cpp	New file exposing get_device_pointer_for_data_and_scales for the discrete-weight kernel path; swizzle-and-pointer-gather logic looks correct.
tests/pytorch/test_fusible_ops.py	Adds tests for the new fused MLP ops; is_supported() check is now properly guarded with pytest.skip rather than a hard assert (prior concern addressed).

Sequence Diagram

sequenceDiagram
    participant Input
    participant FC1_GLU as FC1 GLU Kernel
    participant FC2_QUANT as FC2 QUANT Kernel
    participant Output
    participant FC2_DGLU as FC2 dGLU Kernel
    participant FC1_DGRAD as FC1 dGrad Kernel

    Note over Input,Output: FORWARD PASS
    Input->>FC1_GLU: a_tensor (MXFP8), b_tensor (weight), prob_tensor (scales), bias_tensor
    FC1_GLU-->>FC2_QUANT: d_tensor (MXFP8 act), sfd_row_tensor
    FC1_GLU-->>FC1_GLU: c_tensor (swiglu_in bf16), d_col_tensor, sfd_col_tensor
    FC2_QUANT-->>Output: d_tensor permute(2,0,1).view()

    Note over Input,Output: BACKWARD PASS
    Output->>FC2_DGLU: a_tensor (dy MXFP8), c_tensor (swiglu_in), b_tensor (FC2 weight col)
    FC2_DGLU-->>FC1_DGRAD: d_row_tensor (fc1_dy), sfd_row_tensor
    FC2_DGLU-->>FC2_DGLU: dprob_tensor (grad_scales), dbias_tensor
    FC1_DGRAD-->>Input: d_tensor .view() needs permute(2,0,1) first
    FC2_DGLU-->>FC2_DGLU: wgrad via general_grouped_gemm
    FC1_DGRAD-->>FC1_DGRAD: wgrad via general_grouped_gemm

_{Reviews (35): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py

transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

transformer_engine/pytorch/ops/basic/grouped_linear.py

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

for more information, see https://pre-commit.ci

tests/pytorch/test_fusible_ops.py

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps · 2026-03-17T06:03:55Z

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py

+        fc1_x_data = grouped_fc1_x.rowwise_data.view(in_shape[0], in_shape[1])
+        fc1_x_data = fc1_x_data.view(dtype=torch.float8_e4m3fn)
+        fc1_x_data = fc1_x_data.unsqueeze(0).permute(1, 2, 0)
+        fc1_x_scales = grouped_fc1_x.scale_inv
+        fc1_x_scales = fc1_x_scales.view(dtype=torch.float8_e8m0fnu)
+        fc1_x_scales = fc1_x_scales.view(
+            1,
+            in_shape[0] // 128,
+            in_shape[1] // 128,
+            32,
+            4,
+            4,
+        )
+        fc1_x_scales = fc1_x_scales.permute(3, 4, 1, 5, 2, 0)


No validation that total token count is divisible by 128

The scale tensor view at lines 272–279 uses integer division in_shape[0] // 128 to reshape the MXFP8 scale buffer. If in_shape[0] (i.e., sum(split_sizes)) is not divisible by 128, the view shape product will not match the actual buffer size and either produce incorrect behavior (wrong permute dimensions) or a runtime error with a confusing message.

The constructor checks that in_features % 256 == 0 and out_features % 256 == 0, but nothing validates that the token dimension sum(split_sizes) is divisible by 128 (required by the MXFP8 block-scaling layout). A user passing split sizes like [64, 65] would hit this silently.

The same assumption appears in the backward pass at backward_grouped_mlp.py lines 243–250.

Consider adding a guard before the view:

if in_shape[0] % 128 != 0: raise ValueError( f"Total token count must be divisible by 128 for MXFP8 fused kernel, " f"but got sum(split_sizes)={in_shape[0]}." )

transformer_engine/pytorch/csrc/type_converters.cpp

transformer_engine/pytorch/module/grouped_linear.py

ptrendx

Not yet done with the full review, but cursory glance shows some leftover debugging code and some other random things that should be cleaned up.

ptrendx · 2026-03-17T19:00:45Z

tests/cpp/operator/test_grouped_gemm.cu

+  Tensor setup_ws("setup_ws", std::vector<size_t>{setup_ws_bytes}, DType::kByte);
+  Tensor cublas_ws("cublas_ws", std::vector<size_t>{cublas_ws_bytes}, DType::kByte);
+
+  nvte_grouped_gemm_with_discrete_out(grouped_A.get_handle(),


Not a fan of this name, but it was added in another PR, so not a problem here.

transformer_engine/pytorch/csrc/quantizer.cpp

tests/pytorch/test_fusible_ops.py

transformer_engine/common/include/transformer_engine/transformer_engine.h

transformer_engine/common/transformer_engine.cpp

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

…as_gq' into fused_mxfp8_grouped_mlp_no_rebase

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps · 2026-04-02T16:23:31Z

transformer_engine/pytorch/ops/basic/grouped_linear.py

+        self._apply_delay_wgrad_param_hooks()
+
+    def _apply_delay_wgrad_param_hooks(self) -> None:
+        """Set ``skip_backward_post_hook`` on weights when delaying wgrad (bias uses main backward)."""
+        if not self.wgrad_store.delay_wgrad_compute():
+            return
+        if self.single_grouped_weight:
+            self.weight.skip_backward_post_hook = True
+        else:
+            for group_idx in range(self.num_groups):
+                getattr(self, f"weight{group_idx}").skip_backward_post_hook = True


AttributeError when device="meta" + single_grouped_weight=True + delay_wgrad_compute=True

When device.type == "meta", line 168 skips reset_parameters(), so make_grouped_weights() is never called and self.weight is never registered. _apply_delay_wgrad_param_hooks() is then called unconditionally at line 173 and immediately accesses self.weight (line 180) → AttributeError at construction time.

This breaks any framework (e.g. Megatron-LM) that initialises models on device="meta" with delayed wgrad and single grouped weights. A simple guard fixes it:

Suggested change

self._apply_delay_wgrad_param_hooks()

def _apply_delay_wgrad_param_hooks(self) -> None:

"""Set ``skip_backward_post_hook`` on weights when delaying wgrad (bias uses main backward)."""

if not self.wgrad_store.delay_wgrad_compute():

return

if self.single_grouped_weight:

self.weight.skip_backward_post_hook = True

else:

for group_idx in range(self.num_groups):

getattr(self, f"weight{group_idx}").skip_backward_post_hook = True

if device.type != "meta":

self._apply_delay_wgrad_param_hooks()

pre_first_fuser_forward will call reset_parameters() (and re-invoke _apply_delay_wgrad_param_hooks) once real device params are materialised.

vthumbe1503 · 2026-04-02T18:45:47Z

/te-ci L1 pytorch

Remove unnecessary blank line in docstring.

vthumbe1503 · 2026-04-02T20:17:05Z

/te-ci L1 pytorch

vthumbe1503 · 2026-04-03T01:22:51Z

/te-ci L1 pytorch

greptile-apps · 2026-04-03T02:34:14Z

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py

+        fc1_dy_tensor_offsets = fc1_ctx.base_split_offsets * fc1_weight_shape[0]
+        grouped_fc1_dy = GroupedTensor(
+            shape=(out_shape[0], fc1_weight_shape[0]),
+            dtype=dtype,
+            num_tensors=num_groups,
+            quantizer=fc1_ctx.grad_output_quantizer,
+            data=fc1_dy_row_data,
+            columnwise_data=fc1_dy_col_data,
+            scale_inv=fc1_dy_row_scale,
+            columnwise_scale_inv=fc1_dy_col_scale,
+            first_dims=split_sizes,
+            tensor_offsets=fc1_dy_tensor_offsets,
+            with_gemm_swizzled_scales=True,
+        )


grouped_fc1_dy data tensors not flattened — inconsistency with forward may produce incorrect wgrad

data=fc1_dy_row_data and columnwise_data=fc1_dy_col_data are passed as 2-D tensors (after .view(out_shape[0], fc1_weight_shape[0])), while GroupedTensorStorage expects 1-D flattened buffers (documented at grouped_tensor_storage.py line 44: "ALL data fields are stored as 1D flattened arrays"). The forward's equivalent construction for grouped_fc2_x explicitly flattens every field:

# forward_grouped_mlp.py – what the forward does data=fc2_in_row_data.reshape(-1), columnwise_data=fc2_in_col_data.reshape(-1), scale_inv=fc2_in_row_scale.reshape(-1), columnwise_scale_inv=fc2_in_col_scale.reshape(-1),

The backward omits .reshape(-1) for both data tensors and omits the required permute(5, 2, 4, 0, 1, 3) + reshape(-1) for the scale tensors. When general_grouped_gemm_for_grouped_tensor indexes into the per-group data via element-level offsets it may read from the wrong memory locations, producing silently wrong FC1 weight gradients.

greptile-apps · 2026-04-03T02:34:15Z

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py

+                accumulate_into_main_grad = not getattr(weight_param, "overwrite_main_grad", False)
+                if accumulate_into_main_grad:
+                    grouped_wgrad = GroupedTensor.make_grouped_tensor_from_rowwise_data(
+                        num_tensors=num_groups,
+                        tensor_shape=weight_shape,
+                        rowwise_data=main_grad,
+                        dtype=main_grad.dtype,
+                    )
+
+            if grouped_wgrad is None:
+                grouped_wgrad = GroupedTensor.make_grouped_tensor_with_shapes(
+                    num_tensors=num_groups,
+                    shapes=[weight_shape] * num_groups,
+                    quantizer=None,
+                    device=device,
+                    dtype=dtype,
+                )


overwrite_main_grad=True + single_grouped_weight=True silently drops wgrad into a scratch buffer

When weight_param.overwrite_main_grad is True, accumulate_into_main_grad is set to False (line 97). Because the if accumulate_into_main_grad: branch is skipped, grouped_wgrad remains None and the fallback at line 107 allocates a new scratch buffer entirely unrelated to main_grad. The GEMM writes the weight gradient into this temporary buffer, which is then discarded — main_grad is never updated.

Compare with the single_grouped_weight=False path (lines 116–128): w_list[idx] = wp.main_grad is set unconditionally before the accumulate_into_main_grad determination, so the GEMM always targets main_grad regardless of overwrite_main_grad.

The fix is to populate grouped_wgrad from main_grad before computing accumulate_into_main_grad:

# single_grouped_weight path, before the accumulate flag computation: grouped_wgrad = GroupedTensor.make_grouped_tensor_from_rowwise_data( num_tensors=num_groups, tensor_shape=weight_shape, rowwise_data=main_grad, dtype=main_grad.dtype, ) accumulate_into_main_grad = not getattr(weight_param, "overwrite_main_grad", False)

vthumbe1503

CI is passing now.

Comments have been addressed and CI is green now.

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

vthumbe1503 · 2026-04-03T09:50:44Z

/te-ci L1 pytorch

* GEMM + Swiglu fused Grouped MLP for MXFP8 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * cleanup/lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Properly cache the alpha tensor Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * nD dummy grad Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 0 tokens in entire rank Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tmp downgrade cublas version check Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * delayed wgrad tests pass for basic gl Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * merge everything Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Rebase into fused_mxfp8_grouped_mlp; unit tests for delayed wgrad working Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests being skipped for fusible ops Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Integrate mxfp8 dbias kernel in group_quantize Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add bias/dbias fused support with cute GEMMs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Check bias/dbias support Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Pack biases more efficiently Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * GroupedTensor for biases to avoid concat Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Support 1D grouped tensor shape for bias and fix checkpointing Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes and tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Refactor grouped tensor marking for paged stashing Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove setting logical_shape in mark_grouped_tensor Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup logical_shape Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * pass the tests for now Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address some review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more cleanups Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * refactor wgrad logic Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename argument from single_grouped_parameter to single_grouped_weight Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Check wgrad store context is not empty for 0 token case. Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Test only checks for fusion if fused kernel is available Signed-off-by: Tim Moon <tmoon@nvidia.com> * fix the tolerance to be of bf16 for the cute gemm Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * address further review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address more review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * address more review comments + test for zero grouped tensor work case Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * cublaslt remove zero work gemm avoidance Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the wgrad test Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * split dbias functionality from gq api Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Format and lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * port fixes and add better doc for page stashing war Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Guard fusion via env Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change to trigger CI Remove unnecessary blank line in docstring. * To retrigger CI * Space to trigger the pipeline * fix zero work cublas gemm Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Varun Thumbe <vthumbe@nvidia.com> Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

GEMM + Swiglu fused Grouped MLP for MXFP8

916ee87

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman marked this pull request as draft March 17, 2026 01:31

ksivaman mentioned this pull request Mar 17, 2026

[PyTorch] Add grouped linear op and experimental fusion for grouped MLP #2622

Closed

13 tasks

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

ksivaman added 2 commits March 17, 2026 03:38

cleanup/lint

a15481e

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Properly cache the alpha tensor

bab8bf7

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman added MoE 2.14.0 and removed MoE labels Mar 17, 2026

ksivaman requested a review from vthumbe1503 March 17, 2026 04:55

ksivaman and others added 4 commits March 16, 2026 21:59

nD dummy grad

7d95c17

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

817d6f9

for more information, see https://pre-commit.ci

0 tokens in entire rank

886fc4d

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6bd3812

for more information, see https://pre-commit.ci

ksivaman marked this pull request as ready for review March 17, 2026 05:38

Merge branch 'main' into fused_mxfp8_grouped_mlp

418a430

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

tests/pytorch/test_fusible_ops.py Show resolved Hide resolved

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py Outdated Show resolved Hide resolved

tmp downgrade cublas version check

bf7af9f

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

ptrendx reviewed Mar 17, 2026

View reviewed changes

transformer_engine/pytorch/csrc/type_converters.cpp Outdated Show resolved Hide resolved

ptrendx reviewed Mar 17, 2026

View reviewed changes

transformer_engine/pytorch/module/grouped_linear.py Outdated Show resolved Hide resolved

ptrendx previously requested changes Mar 17, 2026

View reviewed changes

ptrendx reviewed Mar 17, 2026

View reviewed changes

vthumbe1503 reviewed Mar 17, 2026

View reviewed changes

transformer_engine/pytorch/csrc/quantizer.cpp Outdated Show resolved Hide resolved

ptrendx reviewed Mar 17, 2026

View reviewed changes

tests/pytorch/test_fusible_ops.py Outdated Show resolved Hide resolved

vthumbe1503 reviewed Mar 17, 2026

View reviewed changes

transformer_engine/common/include/transformer_engine/transformer_engine.h Outdated Show resolved Hide resolved

vthumbe1503 reviewed Mar 17, 2026

View reviewed changes

transformer_engine/common/transformer_engine.cpp Outdated Show resolved Hide resolved

vthumbe1503 reviewed Mar 17, 2026

View reviewed changes

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu Show resolved Hide resolved

ksivaman added 4 commits April 2, 2026 13:53

Merge branch 'main' into fused_mxfp8_grouped_mlp_no_rebase

bb06127

Merge remote-tracking branch 'origin/fused_mxfp8_grouped_mlp_split_bi…

0e2a532

…as_gq' into fused_mxfp8_grouped_mlp_no_rebase

Format and lint

bf89920

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

port fixes and add better doc for page stashing war

88f2b61

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman force-pushed the fused_mxfp8_grouped_mlp branch from cf2d253 to 88f2b61 Compare April 2, 2026 14:35

NVIDIA deleted a comment from vthumbe1503 Apr 2, 2026

Guard fusion via env

9460778

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps bot reviewed Apr 2, 2026

View reviewed changes

ksivaman mentioned this pull request Apr 2, 2026

Grouped Bias/Dbias Kernel Support After Grouped GEMM #2766

Open

Merge branch 'main' into fused_mxfp8_grouped_mlp

9a648a6

vthumbe1503 added 4 commits April 2, 2026 12:27

Merge branch 'main' into fused_mxfp8_grouped_mlp

10af303

Change to trigger CI

350697d

Remove unnecessary blank line in docstring.

To retrigger CI

d62fa92

Space to trigger the pipeline

1e74e9c

Merge branch 'main' into fused_mxfp8_grouped_mlp

49c2169

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

vthumbe1503 previously approved these changes Apr 3, 2026

View reviewed changes

fix zero work cublas gemm

3c6c35a

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 dismissed their stale review via 3c6c35a April 3, 2026 09:45

vthumbe1503 and others added 2 commits April 3, 2026 02:45

Merge branch 'main' into fused_mxfp8_grouped_mlp

e425f9d

[pre-commit.ci] auto fixes from pre-commit.com hooks

6a4d512

for more information, see https://pre-commit.ci

vthumbe1503 approved these changes Apr 3, 2026

View reviewed changes

vthumbe1503 merged commit 29a8c2f into NVIDIA:main Apr 3, 2026
10 of 14 checks passed

timmoon10 mentioned this pull request Apr 4, 2026

[PyTorch] End-to-end MoE grouped tensor support in grouped MLP #2466

Closed

Conversation

ksivaman commented Mar 17, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ptrendx left a comment

Choose a reason for hiding this comment

Uh oh!

ptrendx Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Apr 2, 2026

Uh oh!

vthumbe1503 commented Apr 2, 2026

Uh oh!

vthumbe1503 commented Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

greptile-apps bot commented Mar 17, 2026 •

edited

Loading