-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] 3bit quant and/or inference regression vs AutoGPTQ #1278
Comments
I am aware of the regression. We recently did a arc test for all bits and 3bits had unexpected lower score than 2bits. We will backtrack and find out which commit(s) broke 3bit quant or inference. |
@sidhantls We have currently bissected the commits and the 3bit ppl/accuracy regression happened between v1.6.0 and v1.8.0 (v1.6.0 had correct 3bit quality). We should have a fix isolated soon. |
@sidhantls Please check |
Thank you! Will do in the next few days |
@Qubitium Hey, I just checked, it still does not seem to work - Does the CI test you used included evaluation on downstream data? For Llama-3.2-3B, 3bits I get much higher performance on AutoGPTQ than here. For eg, MMLU is 50% using AutoGPTQ but this repo it is 24% |
@sidhantls Can you send us the command and/or script code that yo uused to generate mmlu score? We need to replicate both the quant and scoring code for bug fix alignment. Thanks! On a side note, gptqmodel calibration data handling is very different from autogptq and would inherently create two different quants. By default, we do not concat calibration data together, by default, and even if With that said, the difference is too large to not be a You can think of GPTQModel as a fork of autogptq in the begining but now vastly different in almost every possible way. With pending 2.0 release, you will be hard to recognize a single contious block code that is shared with autogptq without a microscrope. =) |
Also please post your complete mmlu score if possible for the two models as mmlu (sum) is sub-divided into many categories. Currently our ci tests uses faster arc-challenge for downstream validation for many tests but I need to check if bits tests is using ppl or more rigorous lm-eval. |
@sidhantls Ran out of time today but for tonight we got the following result: Benchmarked using Torch kernel, all 4bits, group_size=128, C4. We will do more testing tomorrow. Please send us your benchmarking script so we can align our test env. ![]() |
@Qubitium Great, thanks for following up. I'm using the evaluation harness. I can share the script. Also, I had used group 64. I can test with a group 128 and see the result |
@sidhantls Please send us the exact same cli command or short script that triggered the lm-eval since depending on executor of the model, kernels used and args, we will get 5% plus different scores between you and I. I will also check again on gp 64. |
I ran 3bit quantization with group-size 128 with Params:
Metrics(Used the Evaluation Harness):
How to Reproduce:I have attached a google collab that reproduces these results. The outputs are printed and the notebook should run as well @Qubitium Thank you, appreciate you looking into this |
@sidhantls Thanks for the script. Do you have the script for autogptq + eval as well? We need to evaluate both sides exactly. |
@Qubitium Sure, I can share a script to reproduce AutoGPTQ in some time. Any idea though where my ModelCloud script deviates from yours so as to produce different results? Maybe if you shared your ModelCloud script that generated the above table I can try also seeing what's different |
@sidhantls We just merged a huge PR into We did get low scores for 3bits but the score is some what consistent with GPTQModel b2 vs b3 vs b4 vs b8 so that's why we needthe autogptq scripts/eval code to cross-ref so see exactly where it deviated for us or if is there an alignment issue with evaluation. |
Thanks for sharing the table. I cannot find tests/tests_bits_new.py. I only see tests/tests_bits.py |
Try to remove your clone and re-clone. I had to force-push a bad merge yesterday so if you cloned in that 8 hour window, you need to re-clone: https://github.com/ModelCloud/GPTQModel/blob/main/tests/test_bits_new.py |
Good news with lastest |
Describe the bug
Quantizing a model to 3 bits using this repo leads to completely deteriorated performance. On MMLU, it gets 22%. However, when I quantize it using https://github.com/AutoGPTQ/AutoGPTQ, (which is where this repo was forked from?), I get 57%. This was using Llama-3.1-8B-Instruct.
Using this repo, for 4bit I get 66% on MMLU, which is in line with what AutoGPTQ gets for 4 bits
Anyone else noticed that 3bit doesn't work here but works in AutoGPTQ ?
Software Info
Operation System/Version + Python Version
python 3.10
To Reproduce
Quantize model to 3 bits:
Expected behavior
Performance not to break on 3bits and to allign with AutoGPTQ library
The text was updated successfully, but these errors were encountered: