Skip to content

[TorchAO][BMG] The RTN performance of Llama 3.2-1b and Qwen2.5-1.5B shows a regression in next-token latency compared with WW38. #2202

@MingxuZh

Description

@MingxuZh

🐛 Describe the bug

The RTN performance of Llama 3.2-1b and Qwen2.5-1.5B shows a regression in next-token latency compared with WW38.

next token latency regression: ~6%
script:
https://github.com/intel-innersource/frameworks.ai.pytorch.gpu-models/blob/dev/client_gpu_models/LLM/inference/run_generation.py

cmd:
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
python -u run_generation.py -m microsoft/Phi-3.5-mini-instruct --input-tokens 1024 --max-new-tokens 1024 --num-iter 8 --num-warmup 4 --batch-size 1 --device xpu --token-latency --attn-type paged_attention --num-beams 1 --inductor --sub-model-name phi3.5-mini-3.8b --use-hf-code False --use-static-cache --woq --woq-type rtn --quant-dtype uint4 --group-size 128 --device xpu --token-latency

Versions

torch: 2.10.0.dev20251012+xpu
torochao: 0.15.0.dev20251012+xpu
transformers: 4.55.4

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions