Skip to content

Accuracy drop : DeiT-tiny & LLaMA behavior differences between FP / Decomposition / FQ / TOSA #16426

@hth000

Description

@hth000

🐛 Describe the bug

Hi Executorch ARM backend team,

I’m validating the lowering pipeline and collected the following comparison across several models. Quantize with 8-bit only.
Decompose with aten: Decompose by arm backend quantizer.transform_for_annotation

Model FP vs Decompose with aten FP vs Decompose with quant aten (FQ) FP vs TOSA Compare FQ vs TOSA
Mobilenetv2 max error: 0.0
mean error: 0.0
cosine: 1.0
max error: 8.0
mean error: 1.187
cosine: 0.9997984690887474
max error: 13.0
mean error: 1.731
cosine: 0.9995700057524048
max error: 8.0
mean error: 1.226
cosine: 0.9997850644158716
Resnet18 max error: 0.0
mean error: 0.0
cosine: 0.9999999999999999
max error: 6.0
mean error: 1.525
cosine: 0.9990599672781051
max error: 9.0
mean error: 1.815
cosine: 0.9985450487548194
max error: 7.0
mean error: 1.49
cosine: 0.9990350537962956
Deit-tiny max error: 0.0
mean error: 0.0
cosine: 1.0
max error: 96.0
mean error: 21.23
cosine: 0.8994378800187951
max error: 83.0
mean error: 18.682
cosine: 0.9264412351657058
max error: 96.0
mean error: 21.434
cosine: 0.898735675973962
LLaMA max error: 89.0
mean error: 20.6328
cosine: 0.7638448052472464
max error: 82.0
mean error: 19.37109375
cosine: 0.7869657464752766
max error: 82.0
mean error: 19.37109375
cosine: 0.7869657464752766
max error: 0.0
mean error: 0.0
cosine: 1.0

With the experiments, the variables are as following.

  • Executorch commit: 913436a
  • TOSA results are inferencing by tosa_reference_model
  • For all models, test input is torch.rand(input_shape) (batch size:1)
  • For verification only, calibration data == testing input.
  • The models are using executorch/example/models
  • LLaMA weights:
torch.manual_seed(0)
for p in self.model_.parameters():
# p.data.fill_(0)
torch.nn.init.normal_(p, mean=0.0, std=0.02)
for b in self.model_.buffers():
b.data.fill_(0)

Based on the conditions above, I have some questions about the comparison results:

  1. I can roughly understand why DeiT-tiny shows a large difference between FP and its quantized output — this may come from limited quantization granularity or the nature of the model itself. In that sense, it’s reasonable that the TOSA output would also deviate significantly from FP.
    However, what I’m not fully sure about is FQ vs TOSA. In theory, these two should be very close to each other, similar to the other models.
  2. For LLaMA, there is already a very large deviation from the FP model immediately after applying quantizer.transform_for_annotation, even before running FakeQuant or lowering to TOSA.

Is it possible that these models are currently lowerable, but numerical correctness is not yet guaranteed for them?

cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: armIssues related to arm backendpartner: armFor backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions