🧠 QuantLLM: Efficient GGUF Model Quantization and Deployment

📌 Overview

QuantLLM is a Python library designed for efficient model quantization using the GGUF (GGML Universal Format) method. It provides a robust framework for converting and deploying large language models with minimal memory footprint and optimal performance. Key capabilities include:

Memory-efficient GGUF quantization with multiple precision options (2-bit to 8-bit)
Chunk-based processing for handling large models
Comprehensive benchmarking tools
Detailed progress tracking with memory statistics
Easy model export and deployment

🎯 Key Features

Feature	Description
✅ Multiple GGUF Types	Support for various GGUF quantization types (Q2_K to Q8_0) with different precision-size tradeoffs
✅ Memory Optimization	Chunk-based processing and CPU offloading for efficient handling of large models
✅ Progress Tracking	Detailed layer-wise progress with memory statistics and ETA
✅ Benchmarking Tools	Comprehensive benchmarking suite for performance evaluation
✅ Hardware Optimization	Automatic device selection and memory management
✅ Easy Deployment	Simple conversion to GGUF format for deployment
✅ Flexible Configuration	Customizable quantization parameters and processing options

🚀 Getting Started

Installation

Basic installation:

pip install quantllm

With GGUF support (recommended):

pip install quantllm[gguf]

Quick Example

from quantllm import QuantLLM
from transformers import AutoTokenizer

# Load tokenizer and prepare data
model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
calibration_text = ["Example text for calibration."] * 10
calibration_data = tokenizer(calibration_text, return_tensors="pt", padding=True)["input_ids"]

# Quantize model
quantized_model, benchmark_results = QuantLLM.quantize_from_pretrained(
    model_name_or_path=model_name,
    bits=4,                    # Quantization bits (2-8)
    group_size=32,            # Group size for quantization
    quant_type="Q4_K_M",      # GGUF quantization type
    calibration_data=calibration_data,
    benchmark=True,           # Run benchmarks
    benchmark_input_shape=(1, 32)
)

# Save and convert to GGUF
QuantLLM.save_quantized_model(model=quantized_model, output_path="quantized_model")
QuantLLM.convert_to_gguf(model=quantized_model, output_path="model.gguf")

For detailed usage examples and API documentation, please refer to our:

💻 Hardware Requirements

Minimum Requirements

CPU: 4+ cores
RAM: 16GB+
Storage: 10GB+ free space
Python: 3.10+

Recommended for Large Models

CPU: 8+ cores
RAM: 32GB+
GPU: NVIDIA GPU with 8GB+ VRAM
CUDA: 11.7+
Storage: 20GB+ free space

GGUF Quantization Types

Type	Bits	Description	Use Case
Q2_K	2	Extreme compression	Size-critical deployment
Q3_K_S	3	Small size	Limited storage
Q4_K_M	4	Balanced quality	General use
Q5_K_M	5	Higher quality	Quality-sensitive tasks
Q8_0	8	Best quality	Accuracy-critical tasks

🔄 Version Compatibility

QuantLLM	Python	PyTorch	Transformers	CUDA
1.2.0	≥3.10	≥2.0.0	≥4.30.0	≥11.7

🗺 Roadmap

Support for more GGUF model architectures
Enhanced benchmarking capabilities
Multi-GPU processing support
Advanced memory optimization techniques
Integration with more deployment platforms
Custom quantization kernels

🤝 Contributing

We welcome contributions! Please see our CONTRIBUTE.md for guidelines and setup instructions.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

llama.cpp for GGUF format
HuggingFace for Transformers library
CTransformers for GGUF support

📫 Contact & Support

GitHub Issues: Create an issue
Documentation: Read the docs
Discord: Join our community
Email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github		.github
docs		docs
quantllm		quantllm
test		test
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTE.md		CONTRIBUTE.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

🧠 QuantLLM: Efficient GGUF Model Quantization and Deployment

📌 Overview

🎯 Key Features

🚀 Getting Started

Installation

Quick Example

💻 Hardware Requirements

Minimum Requirements

Recommended for Large Models

GGUF Quantization Types

🔄 Version Compatibility

🗺 Roadmap

🤝 Contributing

📝 License

🙏 Acknowledgments

📫 Contact & Support

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

codewithdark-git/QuantLLM

Folders and files

Latest commit

History

Repository files navigation

🧠 QuantLLM: Efficient GGUF Model Quantization and Deployment

📌 Overview

🎯 Key Features

🚀 Getting Started

Installation

Quick Example

💻 Hardware Requirements

Minimum Requirements

Recommended for Large Models

GGUF Quantization Types

🔄 Version Compatibility

🗺 Roadmap

🤝 Contributing

📝 License

🙏 Acknowledgments

📫 Contact & Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages