Add PTQ wrapper support for BERT

## What

We have developed several PTQ wrappers for transformer-based architectures (e.g., LLaMA, Fairseq layers). To broaden applicability, we should extend the PTQ framework to cover BERT and similar on-device optimized models (e.g., DistilBERT, MobileBERT, TinyBERT).

This involves implementing quantization-friendly wrappers that integrate smoothly into the existing `PTQWrapper` and `QuantModuleBase` design, while following the inference-focused principles we already applied:

- Wrap linear/attention modules for PTQ
- (optional) Keep LayerNorm and non-linear ops in FP
- Maintain original I/O shapes and behavior for compatibility with HuggingFace/BERT implementations

### Motivation

BERT-based models are still widely used in on-device NLP tasks such as classification, QA, and NLU. Providing ready-to-use wrappers for these models will demonstrate the generality of our PTQ framework. It also prepares a baseline for applying advanced PTQ algorithms (e.g., GPTQ, SmoothQuant) to BERT-family models.

## Tasks

- Identify core modules in HuggingFace BERT (e.g., BertSelfAttention, BertIntermediate, BertOutput) to be wrapped.
- Implement QuantBertAttention, QuantBertFeedForward, and QuantBertLayer wrappers.
- Ensure compatibility with `QuantConfig` and calibration flows.
- Add unittests comparing FP32 vs PTQ outputs.
- Provide an example script: PTQ BERT → evaluate on downstream task (e.g., GLUE subset).






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PTQ wrapper support for BERT #318

What

Motivation

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add PTQ wrapper support for BERT #318

Description

What

Motivation

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions