Skip to content

Commit

Permalink
clean readme
Browse files Browse the repository at this point in the history
  • Loading branch information
wnma3mz committed Feb 9, 2025
1 parent fa6fcb4 commit 229243c
Show file tree
Hide file tree
Showing 6 changed files with 2 additions and 232 deletions.
106 changes: 0 additions & 106 deletions Performance.md

This file was deleted.

19 changes: 0 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,6 @@
- [X] Engine
- [X] mlx
- [X] torch
- [ ] tinygrad
- [ ] Multi-Request
- [ ] Jit
- [ ] Pipeline
- [X] Communication
- [X] grpc
- [X] Auto Find Node
Expand All @@ -95,20 +91,5 @@ In Mac Mini M4

| | `mlx-community/Llama-3.2-1B-Instruct-4bit` | `mlx-community/Llama-3.2-1B-Instruct` | `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` | `mlx-community/Meta-Llama-3.1-8B-Instruct-bf16` |
| ------------------------------------ | -------------------------------------------- | --------------------------------------- | ------------------------------------------------- | ------------------------------------------------- |
| Mac Mini M4 (16G) (Engine, Baseline) | 98.10 tok/s | 35.45 tok/s | 20.68 tok/s | No Memory |
| Mac Mini M4 (16G) (Local) | 45.36 tok/s | 23.60 tok/s | 15.80 tok/s | No Memory |
| Mac Mini M4 (16G) (Server+Client) | 61.83 tok/s | 34.54 tok/s | 14.91 tok/s | No Memory |
| Mac Mini M4 (16G) + M3 Pro (18G) | | 16.33 tok/s | 11.06 tok/s | 5.64 tok/s |

Q: Why Local is slower than Server+Client?

A:

- Local 只有一个进程,启动了 HTTP Serve, Engine 和 Model 都在一个进程中
- Server+Client 是两个进程,Server 中包含了 HTTP Serve 和 Engine,以及 Embedding 和 LM HEAD;Client 中只有 Model

但不清楚,为什么 `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` 这个不大一样,暂时归因到内存压力上。

Q: Mac Mini M4 (16G) + M3 Pro (18G) 这一列速度为什么慢?

A:理想情况下会等于 Mac Mini M4 (16G) (Server+Client),但由于需要进行通信,通信开销占了主要部分,其中主要是延迟问题导致每个 token 生成都需要花费一定时间,哪怕在局域网内。
19 changes: 0 additions & 19 deletions README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,10 +73,6 @@ For multi-machine deployment, the default part of the port will be used for runn
- [X] Engine
- [X] mlx
- [X] torch
- [ ] tinygrad
- [ ] Multi-Request
- [ ] Jit
- [ ] Pipeline
- [X] Communication
- [X] grpc
- [X] Auto Find Node
Expand All @@ -94,20 +90,5 @@ In Mac Mini M4

| | `mlx-community/Llama-3.2-1B-Instruct-4bit` | `mlx-community/Llama-3.2-1B-Instruct` | `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` | `mlx-community/Meta-Llama-3.1-8B-Instruct-bf16` |
| ------------------------------------ | -------------------------------------------- | --------------------------------------- | ------------------------------------------------- | ------------------------------------------------- |
| Mac Mini M4 (16G) (Engine, Baseline) | 98.10 tok/s | 35.45 tok/s | 20.68 tok/s | No Memory |
| Mac Mini M4 (16G) (Local) | 45.36 tok/s | 23.60 tok/s | 15.80 tok/s | No Memory |
| Mac Mini M4 (16G) (Server+Client) | 61.83 tok/s | 34.54 tok/s | 14.91 tok/s | No Memory |
| Mac Mini M4 (16G) + M3 Pro (18G) | | 16.33 tok/s | 11.06 tok/s | 5.64 tok/s |

Q: Why Local is slower than Server+Client?

A:

- Local only has one process, which starts the HTTP Server, Engine and Model are all in one process.
- Server+Client are two processes, Server contains HTTP Serve and Engine, as well as Embedding and LM HEAD; Client contains only Model

But unclear, why `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` is not much different, temporarily attributed to memory pressure.

Q: Why is the performance of Mac Mini M4 (16G) + M3 Pro (18G) slow?

A: In an ideal scenario, it would be equivalent to a Mac Mini M4 (16G) (Server+Client), but due to the need for communication, the communication overhead accounts for a significant portion of the total cost. The main issue is that each token generation requires a certain amount of time, even within a local network.
86 changes: 0 additions & 86 deletions RoadMap.md

This file was deleted.

2 changes: 1 addition & 1 deletion tllm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ class BackendEnum(Enum):


ENABLE_PREFILL_CACHE = os.environ.get("TLLM_ENABLE_PREFILL_CACHE", "true").lower() == "true"

ENABLE_PREFILL_CACHE = False
if importlib.util.find_spec("mlx"):
BACKEND = BackendEnum.MLX
elif importlib.util.find_spec("torch"):
Expand Down
2 changes: 1 addition & 1 deletion tllm/models/mlx/layers.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,7 +284,7 @@ def __call__(self, x: mx.array, mask, cache) -> mx.array:
r = self.self_attn(self.input_layernorm(x), mask, cache)
h = x + r
# no skip some begin token, and skip middle block, https://arxiv.org/abs/2404.03865
# if 20 <= self.layer_idx <= 24 and x.shape[0] == 1:
# if 24 <= self.layer_idx <= 28 and x.shape[0] == 1:
# return h
r = self.mlp(self.post_attention_layernorm(h))
out = h + r
Expand Down

0 comments on commit 229243c

Please sign in to comment.