Releases: ModelCloud/GPTQModel
v1.0.0
What's Changed
40% faster multi-threaded packing
, new lm_eval
api, fixed python 3.9 compat.
- Add
lm_eval
api by @PZS-ModelCloud in #338 - Multi-threaded
packing
in quantization by PZS-ModelCloud in #354 - [CI] Add TGI unit test by @PZS-ModelCloud in #348
- [CI] Updates by @CSY-ModelCloud in #347, #352, #353, #355, @CSY-ModelCloud in #357
- Fix python 3.9 compat by @PZS-ModelCloud in #358
Full Changelog: v0.9.11...v1.0.0
GPTQModel v0.9.11
What's Changed
Added LG EXAONE 3.0 model support. New dynamic per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to backend.BITBLAS. Auto-heal quantization errors due to small damp values.
- [CORE] add support for pack and shard to bitblas by @LRL-ModelCloud in #316
- Add
dynamic
bits by @PZS-ModelCloud in #311, #319, #321, #323, #327 - [MISC] Adjust the validate order of QuantLinear when BACKEND is AUTO by @ZX-ModelCloud in #318
- add save_quantized log model total size by @PZS-ModelCloud in #320
- Auto damp recovery by @CSY-ModelCloud in #326
- [FIX] add missing original_infeatures by @CSY-ModelCloud in #337
- Update Transformers to 4.44.0 by @Qubitium in #336
- [MODEL] add exaone model support by @LRL-ModelCloud in #340
- [CI] Upload wheel to local server by @CSY-ModelCloud in #339
- [MISC] Fix assert by @CSY-ModelCloud in #342
Full Changelog: v0.9.10...v0.9.11
GPTQModel v0.9.10
What's Changed
Ported vllm/nm gptq_marlin inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with format = FORMAT.GPTQ
. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized()
called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference stage.
- [CORE] add marlin inference kernel by @ZX-ModelCloud in #310
- [CI] Increase timeout to 40m by @CSY-ModelCloud in #295, #299
- [FIX] save_quantized() by @ZX-ModelCloud in #296
- [FIX] autoround nsample/seqlen to be actual size of calibration_dataset. by @LRL-ModelCloud in #297, @LRL-ModelCloud in #298
- Update HF transformers to 4.43.3 by @Qubitium in #305
- [CI] remove test_marlin_hf_cache_serialization() by @ZX-ModelCloud in #314
Full Changelog: v0.9.9...v0.9.10
GPTQModel v0.9.9
What's Changed
Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang.
- [CI] by @CSY-ModelCloud in #238, #236, #237, #241, #242, #243, #246, #247, #250
- [FIX] explicitly call torch.no_grad() by @LRL-ModelCloud in #239
- Bitblas update by @Qubitium in #249
- [FIX] calib avg for calib dataset arg passed as tensors by @Qubitium, @LRL-ModelCloud in #254, #258
- [MODEL] gemma2 27b can load with vLLM now by @LRL-ModelCloud in #257
- [OPTIMIZE] to optimize vllm inference, set an environment variable 'VLLM_ATTENTI… by @LRL-ModelCloud in #260
- [FIX] hard set batch_size to 1 for 4.43.0 transformer due to compat/regression by @LRL-ModelCloud in #279
- FIX vllm llama 3.1 support by @Qubitium in #280
- Use better defaults values for quantization config by @Qubitium in #281
- [REFRACTOR] Cleanup backend and model_type usage by @LRL-ModelCloud in #276
- [FIX] allow auto_round lm_head quantization by @LRL-ModelCloud in #282
- [FIX] [MODEL] Llama-3.1-8B-Instruct's eos_token_id is a list by @CSY-ModelCloud in #284
- [FIX] add release_vllm_model, and import destroy_model_parallel in release_vllm_model by @LRL-ModelCloud in #288
- [FIX] autoround quants compat with vllm/sglang by @Qubitium in #287
Full Changelog: v0.9.8...v0.9.9
GPTQModel v0.9.8
What's Changed
- Marlin end-to-end in/out feature padding for max model support
- Run quantized models (
FORMAT.GPTQ
) directly using fast vLLM backend! - Run quantized models (
FORMAT.GPTQ
) directly using fast SGLang backend!
- 🚀 🚀 [CORE] Marlin end-to-end in/out feature padding by @LRL-ModelCloud in #183 #192
- 🚀 🚀 [CORE] Add vLLM Backend for FORMAT.GPTQ by @PZS-ModelCloud in #190
- 🚀 🚀 [CORE] Add SGLang Backend by @PZS-ModelCloud in #191
- 🚀 [CORE] Use Triton v2 to pack gptq/gptqv2 formats by @LRL-ModelCloud in #202
- ✨ [CLEANUP] remove triton warmup by @Qubitium in #200
- 👾 [FIX] 8bit choosing wrong packer by @Qubitium in #199
- ✨ [CI] [CLEANUP] Improve Unit Tests by CSY, PSY, and ZYC
- ✨ [DOC] Consolidate Examples by ZYC in #225
Full Changelog: v0.9.7...v0.9.8
GPTQModel v0.9.7
What's Changed
- 🚀 [MODEL] InternLM 2.5 support by @LRL-ModelCloud in #182
Full Changelog: v0.9.6...v0.9.7
GPTQModel v0.9.6
What's Changed
Intel/AutoRound QUANT_METHOD support added for a potentially higher quality quantization with lm_head
module quantization support for even more vram reduction: format export to FORMAT.GPTQ
for max inference compatibility.
- 🚀 [CORE] Add AutoRound as Quantizer option by @LRL-ModelCloud in #166
- 👾 [FIX] [CI] Update test by @CSY-ModelCloud in #177
- 👾 Cleanup Triton by @Qubitium in #178
Full Changelog: v0.9.5...v0.9.6
GPTQModel v0.9.5
What's Changed
Another large update with added support for Intel/Qbits quantization/inference on CPU. Cuda kernels have been fully deprecated in favor of better performing Exllama (v1/v2), Marlin, and Triton kernels.
- 🚀🚀 [KERNEL] Added Intel QBits support with [2, 3, 4, 8] bits quantization/inference on CPU by @CSY-ModelCloud in #137
- ✨ [CORE] BaseQuantLinear add SUPPORTED_DEVICES by @ZX-ModelCloud in #174
- ✨ [DEPRECATION] Remove Backend.CUDA and Backend.CUDA_OLD by @ZX-ModelCloud in #165
- 👾 [CI] FIX test perplexity by @ZYC-ModelCloud in #160
Full Changelog: v0.9.4...v0.9.5
GPTQModel v0.9.4
What's Changed
- 🚀 [FEATURE] Added Transformers Integration via monkeypatch by @ZX-ModelCloud in #147
- 👾 [FIX] Typo causing Gemma 2 errors by @LRL-ModelCloud in #158
Full Changelog: v0.9.3...v0.9.4
GPTQModel v0.9.3
What's Changed
- 🚀 [MODEL] Add Gemma 2 support by @LRL-ModelCloud in #131
- 🚀 [OTHER] Calculate ppl on gpu by @ZYC-ModelCloud in #135
- ✨ [REFRACTOR] BaseQuantLinear and avoid using shared QuantLinear cls name by @PZS-ModelCloud in #116
- ✨ [KERNEL] Bitblas cache stablity by @Qubitium in #129
- 👾 [FIX] Export TORCH_CUDA_ARCH_LIST in install.sh by @LeiWang1999 in #133
- 👾 [FIX] Limit Bitblas numexpr thread usage by @Qubitium in #125
- 👾 [FIX] Revert "Skip opt fc1/fc2 for quantization" due to inference regressions (#118)" by @Qubitium in #149
- ✨ [REFRACTOR] remove max_memory arg by @CL-ModelCloud in #144
- 🤖 [CI] Fix test was skipped by @CSY-ModelCloud in #145
- 🤖 [CI] Add GPU selector for runner by @CSY-ModelCloud in #148
New Contributors
- @LeiWang1999 made their first contribution in #133
Full Changelog: v0.9.2...v0.9.3