Skip to content

Commit

Permalink
update en readme
Browse files Browse the repository at this point in the history
  • Loading branch information
wnma3mz committed Feb 2, 2025
1 parent cf7fbfb commit 868d4e5
Show file tree
Hide file tree
Showing 3 changed files with 47 additions and 21 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## together-LLM
## Together-LLM

[English](README_EN.md) | [中文](README.md)

Expand Down
56 changes: 41 additions & 15 deletions README_EN.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
## together-LLM
## Together-LLM

[English](README_EN.md) | [中文](README.md)

Cross-Machine Inference LLM Framework

### Begin quickly!
### Quick Start

1. Install dependencies

Expand All @@ -16,10 +16,10 @@ This machine is running: `python3 ./run_engine.py --model_path mlx-community/Lla
2. Start HTTP service

- Single machine: `tllm.server --model_path mlx-community/Llama-3.2-1B-Instruct-4bit`
-
- Multiple machines:
- Start a server for a service: `tllm.server --model_path mlx-community/Llama-3.2-1B-Instruct-4bit --hostname $YOUR_IP`
- In another terminal, start the client `tllm.client --hostname http://$YOUR_IP:8022`

- Multi-machine:
- Start a server in a terminal: `tllm.server --model_path mlx-community/Llama-3.2-1B-Instruct-4bit --hostname $YOUR_IP`
- Start a client on another terminal `tllm.client --hostname http://$YOUR_IP:8022`

3. Test HTTP service

Expand All @@ -29,17 +29,43 @@ This machine is running: `python3 ./run_engine.py --model_path mlx-community/Lla

- Llama
- Qwen
- Janus Pro: Only supports MacOS platform
- Janus Pro: Currently only supports MacOS platform
- Text to Text: `PYTHONPATH="./" python3 run_janus_pro.py --model_path wnma3mz/Janus-Pro-1B-4bit --message_type llm`
- Image to Text: `PYTHONPATH="./" python3 run_janus_pro.py --model_path wnma3mz/Janus-Pro-1B-4bit --message_type mllm`
- Text to Image: `PYTHONPATH="./" python3 run_janus_pro.py --model_path wnma3mz/Janus-Pro-1B-4bit --message_type image`
- On MacOS, you need to install `pip install mlx-vlm==0.1.12`.
- Flux is currently only supported on MacOS. To use Flux, you will need to install `pip install mflux=0.4.1`.
- Qwen-VL: On MacOS platform, additional installation is required: `pip install mlx-vlm==0.1.12`.
- flux: Currently only supports MacOS platform, requires additional installation `pip install mflux=0.4.1`.

### Advanced

For multi-machine deployment, the default ports are used for running. If special requirements are needed, the configuration file `examples/config.json` can be modified.

For multi-machine deployment, the default part of the port will be used for running. If there are special requirements, you can modify it through the configuration file `examples/config.json`.

```json
{
"server": {
"grpc_port": 25001,
"http_port": 8022,
"hostname": "mac-mini"
},
"client": [
{
"grpc_port": 25002,
"hostname": "m3pro"
},
{
"grpc_port": 25003,
"hostname": "m3"
}
]
}
```

- The number of clients will determine the number of model splits.
- `server.grpc_port`: server's grpc port, used for each client to send status data and the last client to send the computed result
- `server.http_port`: server's http port, API interface as well as WebSocket service
- `server.hostname`: server's hostname, can be replaced with IP, such as 192.168.1.10, make sure client can access
- `client.grpc_port`: client's grpc port
- `client.hostname`: client's hostname, ensure server and other client can access

### Features

Expand Down Expand Up @@ -77,11 +103,11 @@ Q: Why Local is slower than Server+Client?

A:

- Local 只有一个进程,启动了 HTTP Serve, Engine Model 都在一个进程中
- Server+Client 是两个进程,Server 中包含了 HTTP Serve Engine,以及 Embedding LM HEADClient 中只有 Model
- Local only has one process, which starts the HTTP Server, Engine and Model are all in one process.
- Server+Client are two processes, Server contains HTTP Serve and Engine, as well as Embedding and LM HEAD; Client contains only Model

但不清楚,为什么 `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` 这个不大一样,暂时归因到内存压力上。
But unclear, why `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` is not much different, temporarily attributed to memory pressure.

Q: Mac Mini M4 (16G) + M3 Pro (18G) 这一列速度为什么慢?
Q: Why is the performance of Mac Mini M4 (16G) + M3 Pro (18G) slow?

A: In an ideal scenario, it would be equivalent to a Mac Mini M4 (16G) (Server+Client), but due to the need for communication, the communication overhead accounts for a significant portion of the total cost. The main issue is that each token generation requires a certain amount of time, even within a local network.
10 changes: 5 additions & 5 deletions benchmarks/run_async_requests.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ async def requests_func(messages: List[Dict[str, Any]]):


async def main(messages_list: List[List[Dict[str, Any]]]):
# print("异步并发请求结果")
# s1 = time.time()
# await asyncio.gather(*[requests_func(messages) for messages in messages_list])
# print(f"time cost: {time.time() - s1:.4f} s")
print("异步并发请求结果")
s1 = time.time()
await asyncio.gather(*[requests_func(messages) for messages in messages_list])
print(f"time cost: {time.time() - s1:.4f} s")

print("单独请求结果")
s1 = time.time()
Expand All @@ -40,7 +40,7 @@ async def main(messages_list: List[List[Dict[str, Any]]]):


def load_message():
with open("asserts/debug_messages.json", "r") as f:
with open("asserts/messages.json", "r") as f:
messages_dict = json.load(f)
return messages_dict

Expand Down

0 comments on commit 868d4e5

Please sign in to comment.