Skip to content

Releases: ModelCloud/GPTQModel

v1.0.0

14 Aug 00:29
4a028d5
Compare
Choose a tag to compare

What's Changed

40% faster multi-threaded packing, new lm_eval api, fixed python 3.9 compat.

Full Changelog: v0.9.11...v1.0.0

GPTQModel v0.9.11

09 Aug 10:33
f2fcdc8
Compare
Choose a tag to compare

What's Changed

Added LG EXAONE 3.0 model support. New dynamic per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to backend.BITBLAS. Auto-heal quantization errors due to small damp values.

Full Changelog: v0.9.10...v0.9.11

GPTQModel v0.9.10

30 Jul 19:04
233548b
Compare
Choose a tag to compare

What's Changed

Ported vllm/nm gptq_marlin inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with format = FORMAT.GPTQ. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference stage.

Full Changelog: v0.9.9...v0.9.10

GPTQModel v0.9.9

24 Jul 16:42
519fbe3
Compare
Choose a tag to compare

What's Changed

Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang.

Full Changelog: v0.9.8...v0.9.9

GPTQModel v0.9.8

13 Jul 12:55
0d263f3
Compare
Choose a tag to compare

What's Changed

  1. Marlin end-to-end in/out feature padding for max model support
  2. Run quantized models (FORMAT.GPTQ) directly using fast vLLM backend!
  3. Run quantized models (FORMAT.GPTQ) directly using fast SGLang backend!

Full Changelog: v0.9.7...v0.9.8

GPTQModel v0.9.7

08 Jul 11:21
0935662
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.9.6...v0.9.7

GPTQModel v0.9.6

08 Jul 02:59
4fade4c
Compare
Choose a tag to compare

What's Changed

Intel/AutoRound QUANT_METHOD support added for a potentially higher quality quantization with lm_head module quantization support for even more vram reduction: format export to FORMAT.GPTQ for max inference compatibility.

Full Changelog: v0.9.5...v0.9.6

GPTQModel v0.9.5

05 Jul 13:48
f0a1ee8
Compare
Choose a tag to compare

What's Changed

Another large update with added support for Intel/Qbits quantization/inference on CPU. Cuda kernels have been fully deprecated in favor of better performing Exllama (v1/v2), Marlin, and Triton kernels.

Full Changelog: v0.9.4...v0.9.5

GPTQModel v0.9.4

04 Jul 05:41
527cffb
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.9.3...v0.9.4

GPTQModel v0.9.3

02 Jul 18:05
26b3dc0
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.9.2...v0.9.3