From 7eb34915c781e0855f4bc96b2c240bb5dbca27e5 Mon Sep 17 00:00:00 2001
From: Kaushal Powar <90775147+kaushalpowar@users.noreply.github.com>
Date: Thu, 18 Jan 2024 22:31:26 +0530
Subject: [PATCH] Update typo in README.md

There was a typo (pack -> back).

Old:
Each expert per layer is offloaded separately and only brought pack to GPU when needed.

Changed:
Each expert per layer is offloaded separately and only brought back to GPU when needed.
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index a380f20..848d4b3 100644
--- a/README.md
+++ b/README.md
@@ -7,7 +7,7 @@ This project implements efficient inference of [Mixtral-8x7B models](https://mis
 In summary, we achieve efficient inference of Mixtral-8x7B models through a combination of techniques:
 
 * **Mixed quantization with HQQ**. We apply separate quantization schemes for attention layers and experts to fit the model into the combined GPU and CPU memory.
-* **MoE offloading strategy**. Each expert per layer is offloaded separately and only brought pack to GPU when needed. We store active experts in a LRU cache to reduce GPU-RAM communication when computing activations for adjacent tokens.
+* **MoE offloading strategy**. Each expert per layer is offloaded separately and only brought back to GPU when needed. We store active experts in a LRU cache to reduce GPU-RAM communication when computing activations for adjacent tokens.
 
 For more detailed information about our methods and results, please refer to our [tech-report](https://arxiv.org/abs/2312.17238).