-
Notifications
You must be signed in to change notification settings - Fork 14
[OneDNN] Zufang/onednn w4a16 int4 #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Zhu, Zufang <[email protected]>
Signed-off-by: Zhu, Zufang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements OneDNN backend support for W4A16 int4 and FP8 quantized matrix multiplication operations. It migrates and consolidates quantization functionality from IPEX while maintaining a unified interface.
Key changes:
- Add int4 W4A16 GEMM operation with support for symmetric/asymmetric quantization and group quantization
- Add FP8 W8A16 GEMM operation supporting multiple FP8 formats (e4m3fn, e5m2)
- Refactor OneDNN type mappers and bias handling utilities for better code reuse
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tests/test_int4_gemm_onednn.py | Test suite for int4 W4A16 GEMM with various quantization modes and activation ordering |
tests/test_fp8_gemm_onednn.py | Test suite for FP8 W8A16 GEMM with different data types and tensor layouts |
tests/register_ops.py | Python bindings for new int4 and FP8 GEMM operations |
csrc/xpu/torch_bindings.cpp | C++ torch binding registration for int4 GEMM operation |
csrc/xpu/ops.h | Function declaration for int4 GEMM operation |
csrc/xpu/onednn/onednn_matmul.cpp | Main implementation of FP8 and int4 GEMM operations with tensor validation |
csrc/xpu/onednn/onednn_ext.h | Refactored type mappers and bias utilities to support 3-tuple returns and consolidated bias handling |
csrc/xpu/onednn/int4_gemm_w4a16.h | OneDNN-specific implementation for int4 W4A16 matrix multiplication |
csrc/xpu/onednn/fp8_gemm_w8a16.h | Simplified FP8 W8A16 implementation using refactored bias utilities |
csrc/xpu/onednn/fp8_gemm_w8a16.cpp | Removed standalone FP8 implementation (consolidated into onednn_matmul.cpp) |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
ffc81ab
to
64cc20f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
import torch.nn as nn | ||
|
||
|
||
class GPTQShuffle(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will use this in vllm side? I feel this should be renamed to GPTQUtils
, and not inherit nn.module
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
Signed-off-by: Zhu, Zufang <[email protected]>
af406ba
to
d2cf1fa
Compare
Signed-off-by: Zhu, Zufang <[email protected]>
add onednn w4a16 gemm and ut
please see vllm change in https://github.com/intel-sandbox/vllm-xpu/pull/362/files