Accuracy drop : DeiT-tiny & LLaMA behavior differences between FP / Decomposition / FQ / TOSA

### 🐛 Describe the bug

Hi Executorch ARM backend team,

I’m validating the lowering pipeline and collected the following comparison across several models. Quantize with 8-bit only.
Decompose with aten: Decompose by arm backend **`quantizer.transform_for_annotation`**
| Model | **FP vs Decompose with aten** | **FP vs Decompose with quant aten (FQ)** | **FP vs TOSA** | **Compare FQ vs TOSA** |
| --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| **Mobilenetv2** | max error: 0.0 mean error: 0.0 cosine: 1.0 | max error: 8.0 mean error: 1.187 cosine: 0.9997984690887474 | max error: 13.0 mean error: 1.731 cosine: 0.9995700057524048 | max error: 8.0 mean error: 1.226 cosine: 0.9997850644158716 |
| **Resnet18** | max error: 0.0 mean error: 0.0 cosine: 0.9999999999999999 | max error: 6.0 mean error: 1.525 cosine: 0.9990599672781051 | max error: 9.0 mean error: 1.815 cosine: 0.9985450487548194 | max error: 7.0 mean error: 1.49 cosine: 0.9990350537962956 |
| **Deit-tiny** | max error: 0.0 mean error: 0.0 cosine: 1.0 | max error: 96.0 mean error: 21.23 cosine: 0.8994378800187951 | max error: 83.0 mean error: 18.682 cosine: 0.9264412351657058 | **max error: 96.0 mean error: 21.434 cosine: 0.898735675973962** |
| **LLaMA** | **max error: 89.0 mean error: 20.6328 cosine: 0.7638448052472464** | max error: 82.0 mean error: 19.37109375 cosine: 0.7869657464752766 | max error: 82.0 mean error: 19.37109375 cosine: 0.7869657464752766 | max error: 0.0 mean error: 0.0 cosine: 1.0 |

With the experiments, the variables are as following.

- Executorch commit: **913436a4**
- TOSA results are inferencing by **tosa_reference_model**
- For all models, test input is *torch.rand(input_shape)* (batch size:1)
- For verification only, calibration data == testing input. 
- The models are using **executorch/example/models**
- LLaMA weights:

```python
torch.manual_seed(0)
for p in self.model_.parameters():
# p.data.fill_(0)
torch.nn.init.normal_(p, mean=0.0, std=0.02)
for b in self.model_.buffers():
b.data.fill_(0)
```

Based on the conditions above, I have some questions about the comparison results:

1. I can roughly understand why DeiT-tiny shows a large difference between FP and its quantized output — this may come from limited quantization granularity or the nature of the model itself. In that sense, it’s reasonable that the TOSA output would also deviate significantly from FP.
 However, what I’m not fully sure about is **FQ vs TOSA**. In theory, these two should be very close to each other, similar to the other models.
2. For LLaMA, there is already a very large deviation from the FP model **immediately after applying `quantizer.transform_for_annotation`**, even before running FakeQuant or lowering to TOSA.

Is it possible that these models are currently *lowerable*, but numerical correctness is not yet guaranteed for them?


cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accuracy drop : DeiT-tiny & LLaMA behavior differences between FP / Decomposition / FQ / TOSA #16426

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	FP vs Decompose with aten	FP vs Decompose with quant aten (FQ)	FP vs TOSA	Compare FQ vs TOSA
Mobilenetv2	max error: 0.0 mean error: 0.0 cosine: 1.0	max error: 8.0 mean error: 1.187 cosine: 0.9997984690887474	max error: 13.0 mean error: 1.731 cosine: 0.9995700057524048	max error: 8.0 mean error: 1.226 cosine: 0.9997850644158716
Resnet18	max error: 0.0 mean error: 0.0 cosine: 0.9999999999999999	max error: 6.0 mean error: 1.525 cosine: 0.9990599672781051	max error: 9.0 mean error: 1.815 cosine: 0.9985450487548194	max error: 7.0 mean error: 1.49 cosine: 0.9990350537962956
Deit-tiny	max error: 0.0 mean error: 0.0 cosine: 1.0	max error: 96.0 mean error: 21.23 cosine: 0.8994378800187951	max error: 83.0 mean error: 18.682 cosine: 0.9264412351657058	max error: 96.0 mean error: 21.434 cosine: 0.898735675973962
LLaMA	max error: 89.0 mean error: 20.6328 cosine: 0.7638448052472464	max error: 82.0 mean error: 19.37109375 cosine: 0.7869657464752766	max error: 82.0 mean error: 19.37109375 cosine: 0.7869657464752766	max error: 0.0 mean error: 0.0 cosine: 1.0

Accuracy drop : DeiT-tiny & LLaMA behavior differences between FP / Decomposition / FQ / TOSA #16426

Description

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions