From 7eb34915c781e0855f4bc96b2c240bb5dbca27e5 Mon Sep 17 00:00:00 2001 From: Kaushal Powar <90775147+kaushalpowar@users.noreply.github.com> Date: Thu, 18 Jan 2024 22:31:26 +0530 Subject: [PATCH] Update typo in README.md There was a typo (pack -> back). Old: Each expert per layer is offloaded separately and only brought pack to GPU when needed. Changed: Each expert per layer is offloaded separately and only brought back to GPU when needed. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a380f20..848d4b3 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ This project implements efficient inference of [Mixtral-8x7B models](https://mis In summary, we achieve efficient inference of Mixtral-8x7B models through a combination of techniques: * **Mixed quantization with HQQ**. We apply separate quantization schemes for attention layers and experts to fit the model into the combined GPU and CPU memory. -* **MoE offloading strategy**. Each expert per layer is offloaded separately and only brought pack to GPU when needed. We store active experts in a LRU cache to reduce GPU-RAM communication when computing activations for adjacent tokens. +* **MoE offloading strategy**. Each expert per layer is offloaded separately and only brought back to GPU when needed. We store active experts in a LRU cache to reduce GPU-RAM communication when computing activations for adjacent tokens. For more detailed information about our methods and results, please refer to our [tech-report](https://arxiv.org/abs/2312.17238).