Skip to content

Add PTQ wrapper support for BERT #318

@mhs4670go

Description

@mhs4670go

What

We have developed several PTQ wrappers for transformer-based architectures (e.g., LLaMA, Fairseq layers). To broaden applicability, we should extend the PTQ framework to cover BERT and similar on-device optimized models (e.g., DistilBERT, MobileBERT, TinyBERT).

This involves implementing quantization-friendly wrappers that integrate smoothly into the existing PTQWrapper and QuantModuleBase design, while following the inference-focused principles we already applied:

  • Wrap linear/attention modules for PTQ
  • (optional) Keep LayerNorm and non-linear ops in FP
  • Maintain original I/O shapes and behavior for compatibility with HuggingFace/BERT implementations

Motivation

BERT-based models are still widely used in on-device NLP tasks such as classification, QA, and NLU. Providing ready-to-use wrappers for these models will demonstrate the generality of our PTQ framework. It also prepares a baseline for applying advanced PTQ algorithms (e.g., GPTQ, SmoothQuant) to BERT-family models.

Tasks

  • Identify core modules in HuggingFace BERT (e.g., BertSelfAttention, BertIntermediate, BertOutput) to be wrapped.
  • Implement QuantBertAttention, QuantBertFeedForward, and QuantBertLayer wrappers.
  • Ensure compatibility with QuantConfig and calibration flows.
  • Add unittests comparing FP32 vs PTQ outputs.
  • Provide an example script: PTQ BERT → evaluate on downstream task (e.g., GLUE subset).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions