diff --git a/README.md b/README.md index 72b0fc08..3ee5b7a6 100644 --- a/README.md +++ b/README.md @@ -22,6 +22,7 @@ A more accessible, comprehensive, and efficient toolkit for large model compress

## 📣Latest News +- [26/05/08] We have released STQ1_0 kernel for 1.25-bit model and given a PR to llama.cpp [PR #22836](https://github.com/ggml-org/llama.cpp/pull/22836) ! If you have any questions or suggestions for STQ_0, welcome to comment under the PR !🔥🔥🔥 - [26/04/29] We have released 2-bit and 1.25-bit versions of Tencent Hy-MT1.5-1.8B Translation Model: [Hy-MT1.5-1.8B-2bit](https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-2bit) and [Hy-MT1.5-1.8B-1.25bit](https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit). Additionally, we have make an [offline translation demo](https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit/blob/main/Hy-MT-demo.apk) for you to try out. We invite you to give it a spin! 🔥🔥🔥 - [26/04/23] We now support FP8-Static quantization for **Hy3-preview** (MoE A20B). - [26/03/25] We have released **DAQ**, the quantization algorithm that preserves the knowledge acquired while the update of parameters is relatively small during post-training training.[[Paper]](https://arxiv.org/abs/2603.22324) | [[Docs]](docs/source/features/quantization/daq.md) diff --git a/README_cn.md b/README_cn.md index d5342d4b..7e6ace5e 100644 --- a/README_cn.md +++ b/README_cn.md @@ -22,6 +22,7 @@

## 📣最新进展 +- [26/05/08] 我们发布了用于 1.25-bit 模型的 STQ1_0 内核,并向 llama.cpp 提交了 [PR #22836](https://github.com/ggml-org/llama.cpp/pull/22836)!如果您对 STQ_0 有任何疑问或建议,欢迎在该 PR 下留言!🔥🔥🔥 - [26/04/29] 我们发布了 2bit 与 1.25bit 腾讯混元翻译模型 [Hy-MT1.5-1.8B-2bit](https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-2bit), [Hy-MT1.5-1.8B-1.25bit](https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit)。并且还制作了 [离线翻译体验 Demo](https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit/blob/main/Hy-MT-demo.apk)。 欢迎体验 🔥🔥🔥 - [26/04/23] 我们支持了 **Hy3-preview**(MoE A20B)模型的 FP8-Static 量化。 - [26/03/25] 我们发布了量化算法DAQ,该方法在后训练参数更新较小时,可保留量化后模型能力 [[论文]](https://arxiv.org/abs/2603.22324) | [[文档]](docs/source/features/quantization/daq.md) diff --git a/docs/source/models/Hy-MT1.5/hy-mt1.5.md b/docs/source/models/Hy-MT1.5/hy-mt1.5.md index 62cb02ea..95f6149c 100644 --- a/docs/source/models/Hy-MT1.5/hy-mt1.5.md +++ b/docs/source/models/Hy-MT1.5/hy-mt1.5.md @@ -100,7 +100,78 @@ Demo device: Snapdragon 7+ Gen 2, 16GB RAM. ::: ## 💻 Deployment -Our llama.cpp kernel (including STQ kernel) is coming soon. + +### Clone llama.cpp + +```bash +git clone https://github.com/ggml-org/llama.cpp.git +``` + +### Enter the llama.cpp folder + +```bash +cd llama.cpp +``` + +### Fetch and check out the PR branch + +```bash +git fetch origin pull/22836/head:pr-22836-stq_0 +git checkout pr-22836-stq_0 +``` + +### Build llama.cpp + +```bash +pip install -r requirements.txt +cmake -B build +cmake --build build --config Release +``` + +### Download the HF model + + +```bash +pip install huggingface_hub +huggingface-cli download AngelSlim/Hy-MT1.5-1.8B-1.25bit \ + --local-dir model_zoo/Hy-MT1.5-1.8B-1.25bit +``` + +### Convert HF → bf16 GGUF + +```bash +python convert_hf_to_gguf.py model_zoo/Hy-MT1.5-1.8B-1.25bit \ + --outfile model_zoo/Hy-MT1.5-1.8B-bf16.gguf \ + --outtype bf16 +``` + +### Quantize bf16 → STQ1_0 + +```bash +./build/bin/llama-quantize \ + model_zoo/Hy-MT1.5-1.8B-bf16.gguf \ + model_zoo/Hy-MT1.5-1.8B-STQ1_0.gguf \ + STQ1_0 +``` + +### Run a completion example + +The prompt format can be viewed at [HY-MT1.5-1.8B](https://huggingface.co/tencent/HY-MT1.5-1.8B) + +```bash +./build/bin/llama-completion \ + --model model_zoo/Hy-MT1.5-1.8B-STQ1_0.gguf \ + -p "Translate the following segment into Chinese, without additional explanation. Hello " \ + --jinja \ + -ngl 0 \ + -n 64 -st +``` + +### Run the llama.cpp benchmark + +```bash +./build/bin/llama-bench -m model_zoo/Hy-MT1.5-1.8B-STQ1_0.gguf -ngl 0 +``` ## 📥 Download Links