Skip to content

add CPU FP8 QDQ doc #2240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 19 additions & 5 deletions docs/source/3x/PT_FP8Quant.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@ FP8 Quantization
=======

1. [Introduction](#introduction)
2. [Supported Parameters](#supported-parameters)
3. [Get Start with FP8 Quantization](#get-start-with-fp8-quantization)
4. [Optimum-habana LLM example](#optimum-habana-LLM-example)
5. [VLLM example](#VLLM-example)
2. [Support Matrix](#support-matrix)
3. [Supported Parameters](#supported-parameters)
4. [Get Start with FP8 Quantization](#get-start-with-fp8-quantization)
5. [Optimum-habana LLM example](#optimum-habana-LLM-example)
6. [VLLM example](#VLLM-example)

## Introduction

Expand All @@ -17,7 +18,20 @@ Float point 8 (FP8) is a promising data type for low precision quantization whic

Intel Gaudi2, also known as HPU, provides this data type capability for low precision quantization, which includes `E4M3` and `E5M2`. For more information about these two data type, please refer to [link](https://arxiv.org/abs/2209.05433).

Intel Neural Compressor provides general quantization APIs to leverage HPU FP8 capability. with simple with lower memory usage and lower compute cost, 8 bit model
To harness FP8 capabilities — offering reduced memory usage and lower computational costs — Intel Neural Compressor provides general quantization APIs to generate FP8 models.

## Support Matrix

| Hardware | FP8 mode | FP8 QDQ mode |
| :------- |:--------|:---------|
| HPU | ✔ | ✔ |
| CPU | ✕ | ✔ |

For FP8 mode, tensors are all represented in FP8 format and kernels are replaced to FP8 version explicitly.

For FP8 QDQ mode, activations are still in high precision and quant/dequant pairs are inserted. Frameworks can compile and fuse operators of FP8 QDQ model based on their own capability.

During runtime, Intel Neural Compressor will detect hardware automatically and the priority is HPU > CPU.

## Supported Parameters

Expand Down