-
Notifications
You must be signed in to change notification settings - Fork 22
Description
What
We have developed several PTQ wrappers for transformer-based architectures (e.g., LLaMA, Fairseq layers). To broaden applicability, we should extend the PTQ framework to cover BERT and similar on-device optimized models (e.g., DistilBERT, MobileBERT, TinyBERT).
This involves implementing quantization-friendly wrappers that integrate smoothly into the existing PTQWrapper and QuantModuleBase design, while following the inference-focused principles we already applied:
- Wrap linear/attention modules for PTQ
- (optional) Keep LayerNorm and non-linear ops in FP
- Maintain original I/O shapes and behavior for compatibility with HuggingFace/BERT implementations
Motivation
BERT-based models are still widely used in on-device NLP tasks such as classification, QA, and NLU. Providing ready-to-use wrappers for these models will demonstrate the generality of our PTQ framework. It also prepares a baseline for applying advanced PTQ algorithms (e.g., GPTQ, SmoothQuant) to BERT-family models.
Tasks
- Identify core modules in HuggingFace BERT (e.g., BertSelfAttention, BertIntermediate, BertOutput) to be wrapped.
- Implement QuantBertAttention, QuantBertFeedForward, and QuantBertLayer wrappers.
- Ensure compatibility with
QuantConfigand calibration flows. - Add unittests comparing FP32 vs PTQ outputs.
- Provide an example script: PTQ BERT → evaluate on downstream task (e.g., GLUE subset).