-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for W8A8 quantization with CPU weight offloading #1078
Comments
Hi @NeoChen1024 do you mind providing the code snippet of what you have tried to run with Smoothquant? |
My own generalized quantization script is here: https://github.com/NeoChen1024/scripts/blob/master/llm-compressor-quantize.py |
@NeoChen1024 have you tried providing a device map when running GPTQ? E.g.
|
I tried that before, but apparently oneshot doesn't like meta tensors, it expects all tensors to be in GPU memory. So I can't quantize models larger than 8B on my single 24GiB GPU. |
HI @NeoChen1024 can you share what version of the packages you used the device map with? |
Newest transformers (4.48.0) and newest llm-compressor (0.3.1) |
If it helps, here's a bit of code I tried to use for Mistral-Small-24B-2501 on my 4090 (24GB of VRAM)
Throws: |
I always face problems when trying to run oneshot calibration with CPU offloaded tensors, this limits GPUs with 24GiB VRAM from quantizing models larger than 7~8B.
Describe the solution you'd like
Add support for block swap / layer-wise loading
Describe alternatives you've considered
accelerate's offloading (device_map = "auto") and DeepSpeed ZeRO stage 3 doesn't work when I tried it with oneshot.
Additional context
Doing block swap / layer-wise loading with probably slow-down SmoothQuantModifier calibration by many times, but that enables running calibrations with longer context / bigger models.
The text was updated successfully, but these errors were encountered: