Skip to content

Conversation

zufangzhu
Copy link
Collaborator

@zufangzhu zufangzhu commented Oct 10, 2025

add onednn w4a16 gemm and ut
please see vllm change in https://github.com/intel-sandbox/vllm-xpu/pull/362/files

Signed-off-by: Zhu, Zufang <[email protected]>
Signed-off-by: Zhu, Zufang <[email protected]>
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements OneDNN backend support for W4A16 int4 and FP8 quantized matrix multiplication operations. It migrates and consolidates quantization functionality from IPEX while maintaining a unified interface.

Key changes:

  • Add int4 W4A16 GEMM operation with support for symmetric/asymmetric quantization and group quantization
  • Add FP8 W8A16 GEMM operation supporting multiple FP8 formats (e4m3fn, e5m2)
  • Refactor OneDNN type mappers and bias handling utilities for better code reuse

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_int4_gemm_onednn.py Test suite for int4 W4A16 GEMM with various quantization modes and activation ordering
tests/test_fp8_gemm_onednn.py Test suite for FP8 W8A16 GEMM with different data types and tensor layouts
tests/register_ops.py Python bindings for new int4 and FP8 GEMM operations
csrc/xpu/torch_bindings.cpp C++ torch binding registration for int4 GEMM operation
csrc/xpu/ops.h Function declaration for int4 GEMM operation
csrc/xpu/onednn/onednn_matmul.cpp Main implementation of FP8 and int4 GEMM operations with tensor validation
csrc/xpu/onednn/onednn_ext.h Refactored type mappers and bias utilities to support 3-tuple returns and consolidated bias handling
csrc/xpu/onednn/int4_gemm_w4a16.h OneDNN-specific implementation for int4 W4A16 matrix multiplication
csrc/xpu/onednn/fp8_gemm_w8a16.h Simplified FP8 W8A16 implementation using refactored bias utilities
csrc/xpu/onednn/fp8_gemm_w8a16.cpp Removed standalone FP8 implementation (consolidated into onednn_matmul.cpp)

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@zufangzhu zufangzhu force-pushed the zufang/onednn_w4a16_int4 branch from ffc81ab to 64cc20f Compare October 13, 2025 05:16
@zufangzhu zufangzhu added onednn and removed WIP labels Oct 13, 2025
Copy link
Collaborator

@baodii baodii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

import torch.nn as nn


class GPTQShuffle(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will use this in vllm side? I feel this should be renamed to GPTQUtils, and not inherit nn.module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Signed-off-by: Zhu, Zufang <[email protected]>
@zufangzhu zufangzhu force-pushed the zufang/onednn_w4a16_int4 branch from af406ba to d2cf1fa Compare October 16, 2025 01:37
Signed-off-by: Zhu, Zufang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants