clean readme

wnma3mz · Feb 9, 2025 · 229243c · 229243c
1 parent fa6fcb4
commit 229243c
Show file tree

Hide file tree

Showing 6 changed files with 2 additions and 232 deletions.
diff --git a/Performance.md b/Performance.md
diff --git a/README.md b/README.md
@@ -74,10 +74,6 @@
 - [X] Engine
   - [X] mlx
   - [X] torch
-  - [ ] tinygrad
-    - [ ] Multi-Request
-    - [ ] Jit
-    - [ ] Pipeline
 - [X] Communication
   - [X] grpc
   - [X] Auto Find Node
@@ -95,20 +91,5 @@ In Mac Mini M4
 
 |                                      | `mlx-community/Llama-3.2-1B-Instruct-4bit` | `mlx-community/Llama-3.2-1B-Instruct` | `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` | `mlx-community/Meta-Llama-3.1-8B-Instruct-bf16` |
 | ------------------------------------ | -------------------------------------------- | --------------------------------------- | ------------------------------------------------- | ------------------------------------------------- |
-| Mac Mini M4 (16G) (Engine, Baseline) | 98.10 tok/s                                 | 35.45 tok/s                             | 20.68 tok/s                                       | No Memory                                         |
 | Mac Mini M4 (16G) (Local)            | 45.36 tok/s                                 | 23.60 tok/s                             | 15.80 tok/s                                       | No Memory                                         |
-| Mac Mini M4 (16G) (Server+Client)    | 61.83 tok/s                                 | 34.54 tok/s                             | 14.91 tok/s                                       | No Memory                                         |
 | Mac Mini M4 (16G) + M3 Pro (18G)     |                                              | 16.33 tok/s                             | 11.06 tok/s                                       | 5.64 tok/s                                        |
-
-Q: Why Local is slower than Server+Client?
-
-A:
-
-- Local 只有一个进程，启动了 HTTP Serve， Engine 和 Model 都在一个进程中
-- Server+Client 是两个进程，Server 中包含了 HTTP Serve 和 Engine，以及 Embedding 和 LM HEAD；Client 中只有 Model
-
-但不清楚，为什么 `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` 这个不大一样，暂时归因到内存压力上。
-
-Q: Mac Mini M4 (16G) + M3 Pro (18G) 这一列速度为什么慢？
-
-A：理想情况下会等于 Mac Mini M4 (16G) (Server+Client)，但由于需要进行通信，通信开销占了主要部分，其中主要是延迟问题导致每个 token 生成都需要花费一定时间，哪怕在局域网内。
diff --git a/README_EN.md b/README_EN.md
@@ -73,10 +73,6 @@ For multi-machine deployment, the default part of the port will be used for runn
 - [X] Engine
   - [X] mlx
   - [X] torch
-  - [ ] tinygrad
-    - [ ] Multi-Request
-    - [ ] Jit
-    - [ ] Pipeline
 - [X] Communication
   - [X] grpc
   - [X] Auto Find Node
@@ -94,20 +90,5 @@ In Mac Mini M4
 
 |                                      | `mlx-community/Llama-3.2-1B-Instruct-4bit` | `mlx-community/Llama-3.2-1B-Instruct` | `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` | `mlx-community/Meta-Llama-3.1-8B-Instruct-bf16` |
 | ------------------------------------ | -------------------------------------------- | --------------------------------------- | ------------------------------------------------- | ------------------------------------------------- |
-| Mac Mini M4 (16G) (Engine, Baseline) | 98.10 tok/s                                 | 35.45 tok/s                             | 20.68 tok/s                                       | No Memory                                         |
 | Mac Mini M4 (16G) (Local)            | 45.36 tok/s                                 | 23.60 tok/s                             | 15.80 tok/s                                       | No Memory                                         |
-| Mac Mini M4 (16G) (Server+Client)    | 61.83 tok/s                                 | 34.54 tok/s                             | 14.91 tok/s                                       | No Memory                                         |
 | Mac Mini M4 (16G) + M3 Pro (18G)     |                                              | 16.33 tok/s                             | 11.06 tok/s                                       | 5.64 tok/s                                        |
-
-Q: Why Local is slower than Server+Client?
-
-A:
-
-- Local only has one process, which starts the HTTP Server, Engine and Model are all in one process.
-- Server+Client are two processes, Server contains HTTP Serve and Engine, as well as Embedding and LM HEAD; Client contains only Model
-
-But unclear, why `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` is not much different, temporarily attributed to memory pressure.
-
-Q: Why is the performance of Mac Mini M4 (16G) + M3 Pro (18G) slow?
-
-A: In an ideal scenario, it would be equivalent to a Mac Mini M4 (16G) (Server+Client), but due to the need for communication, the communication overhead accounts for a significant portion of the total cost. The main issue is that each token generation requires a certain amount of time, even within a local network.
diff --git a/RoadMap.md b/RoadMap.md
diff --git a/tllm/__init__.py b/tllm/__init__.py
@@ -9,7 +9,7 @@ class BackendEnum(Enum):
 
 
 ENABLE_PREFILL_CACHE = os.environ.get("TLLM_ENABLE_PREFILL_CACHE", "true").lower() == "true"
-
+ENABLE_PREFILL_CACHE = False
 if importlib.util.find_spec("mlx"):
     BACKEND = BackendEnum.MLX
 elif importlib.util.find_spec("torch"):

diff --git a/tllm/models/mlx/layers.py b/tllm/models/mlx/layers.py
@@ -284,7 +284,7 @@ def __call__(self, x: mx.array, mask, cache) -> mx.array:
         r = self.self_attn(self.input_layernorm(x), mask, cache)
         h = x + r
         # no skip some begin token, and skip middle block, https://arxiv.org/abs/2404.03865
-        # if 20 <= self.layer_idx <= 24 and x.shape[0] == 1:
+        # if 24 <= self.layer_idx <= 28 and x.shape[0] == 1:
         #     return h
         r = self.mlp(self.post_attention_layernorm(h))
         out = h + r