diff --git a/.gitignore b/.gitignore index 6dc2abb..c2cdf6c 100644 --- a/.gitignore +++ b/.gitignore @@ -8,4 +8,6 @@ __pycache__ weights/ .DS_Store *.png -*.pt \ No newline at end of file +*.pt +build/ +*.egg-info/ \ No newline at end of file diff --git a/README.md b/README.md index 0e67aa5..3e87ad6 100644 --- a/README.md +++ b/README.md @@ -4,45 +4,35 @@ ### QuickStart -1. download model from: https://huggingface.co/mlx-community/Llama-3.2-1B-Instruct-bf16 +1. install dependencies -2. install dependencies +- for mlx (macos arm): `pip install -e ".[mlx]"` +- for nvidia: `pip install -e ".[torch]"` -- for mlx: `pip install -r requirements-mlx.txt` -- for nvidia: `pip install -r requirements-cuda.txt` -- for intel: `pip install -r requirements.txt` +2. run server -3. run server + 2.1 (no communication) - 3.1 (no communication) + ```bash + tllm.server --model_path mlx-community/Llama-3.2-1B-Instruct-4bit --is_local + ``` - - edit `examples/run_single_server.sh` + 2.2 (with communication) - ```bash - bash examples/run_single_server.sh - ``` + ```bash + # first in one terminal + tllm.server --model_path mlx-community/Llama-3.2-1B-Instruct-4bit --hostname $YOUR_IP - 3.2 (with communication) + # in another terminal + tllm.client --hostname $YOUR_IP + ``` +3. testing - - edit `examples/run_client.sh` - - - edit `examples/run_server.sh` - - ```bash - # first in one terminal - bash examples/run_server.sh - - # in another terminal - bash examples/run_client.sh - ``` - -4. testing - -```python -python benchmarks/run_async_requests.py +```bash +python3 benchmarks/run_async_requests.py ``` -### Config +### More Details In `examples/config.json` @@ -69,34 +59,55 @@ In `examples/config.json` ### Features -- [x] Support Multi-Requests -- [x] Engine - - [x] mlx - - [x] torch - - [ ] tinygrad - - [ ] Multi-Request - - [ ] Jit - - [ ] Pipeline -- [x] Communication - - [x] grpc - - [x] Auto Find Node - - [x] Simple Get Ip - - [x] Test Ping -- [x] Attention - - [x] xformers - - [x] flash-attn - - [ ] PageAttention +- [X] Support Multi-Requests +- [X] Engine + - [X] mlx + - [X] torch + - [ ] tinygrad + - [ ] Multi-Request + - [ ] Jit + - [ ] Pipeline +- [X] Communication + - [X] grpc + - [X] Auto Find Node + - [X] Simple Get Ip + - [X] Test Ping +- [X] Attention + - [X] xformers + - [X] flash-attn + - [ ] PageAttention ### Performance -For 1b +In Mac Mini M4 -- mac mini m2 -![alt text](asserts/image.png) +| | `mlx-community/Llama-3.2-1B-Instruct-4bit` | `mlx-community/Llama-3.2-1B-Instruct` | `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit` | +| -------------------- | -------------------------------------------- | --------------------------------------- | ------------------------------------------------- | +| Mac Mini M4 | 98.10 tok/s | 35.45 tok/s | 20.68 tok/s | +| Mac Mini M4 + M3 Pro | | | | + +For `mlx-community/Llama-3.2-1B-Instruct-4bit`, + +![1734779816425](image/README/1734779816425.png) +For `mlx-community/Llama-3.2-1B-Instruct`, + +![1734779931105](image/README/1734779931105.png) + +For `mlx-community/Meta-Llama-3.1-8B-Instruct-4bit`, + +![1734779890405](image/README/1734779890405.png) + +old version + +For `mlx-community/Llama-3.2-1B-Instruct` + +- mac mini m2 + ![alt text](asserts/image.png) - m3 pro -![alt text](asserts/image-1.png) + ![alt text](asserts/image-1.png) for 8b -- m3 pro (layer=8) + mac mini m2 (layer=24) -![alt text](asserts/image-2.png) \ No newline at end of file + +- m3 pro (layer=8) + mac mini m2 (layer=24) + ![alt text](asserts/image-2.png) diff --git a/examples/run_client.sh b/examples/run_client.sh index 7d7e916..504428b 100644 --- a/examples/run_client.sh +++ b/examples/run_client.sh @@ -2,8 +2,5 @@ # master 的地址,请求分配模型的节点 MASTER_URL=http://mac-mini:8022 -export OMP_NUM_THREADS=8; -export PYTHONPATH="./":$PYTHONPATH; - -python3 -m tllm.entrypoints.handler.handler --master_addr $MASTER_URL --is_debug -# python3 -m tllm.entrypoints.handler.handler --master_addr $MASTER_URL --is_debug --config examples/config_one.json --client_idx 0 \ No newline at end of file +tllm.client --master_addr $MASTER_URL --is_debug +# tllm.client --master_addr $MASTER_URL --is_debug --config examples/config_one.json --client_idx 0 \ No newline at end of file diff --git a/examples/run_engine.py b/examples/run_engine.py index 8e22feb..9d18e73 100644 --- a/examples/run_engine.py +++ b/examples/run_engine.py @@ -28,9 +28,9 @@ def parse_args(): @dataclass class Args: - # model_path: str = "/Users/lujianghu/Documents/Llama-3.2-3B-Instruct" + model_path: str = "/Users/lujianghu/Documents/Llama-3.2-1B-Instruct" # model_path: str = "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit" - model_path: str = "mlx-community/Llama-3.2-1B-Instruct-4bit" + # model_path: str = "mlx-community/Llama-3.2-1B-Instruct-4bit" # model_path: str = "/Users/lujianghu/Documents/flux/schnell_4bit" # model_path: str = "Qwen/Qwen2.5-0.5B-Instruct" # model_path: str = "Qwen/Qwen2-VL-2B-Instruct" @@ -77,7 +77,7 @@ async def llm_generate(args, messages): messages = [{"role": "user", "content": "Hello, how are you?"}] openai_serving_chat = OpenAIServing(engine, args) - request = ChatCompletionRequest(model="test", messages=messages) + request = ChatCompletionRequest(model="test", messages=messages, max_tokens=100) response = await openai_serving_chat.create_chat_completion(request, None) print(response) diff --git a/examples/run_server.sh b/examples/run_server.sh index 38545a5..b42db38 100644 --- a/examples/run_server.sh +++ b/examples/run_server.sh @@ -4,7 +4,5 @@ MODEL_PATH=Qwen/Qwen2-VL-2B-Instruct MASTER_HOSTNAME=mac-mini -export PYTHONPATH="./":$PYTHONPATH; - -python3 -m tllm.entrypoints.api_server --hostname $MASTER_HOSTNAME --model_path $MODEL_PATH --is_debug -# python3 -m tllm.entrypoints.api_server --hostname $MASTER_HOSTNAME --model_path $MODEL_PATH --is_debug --config examples/config_one.json \ No newline at end of file +tllm.server --model_path $MODEL_PATH --hostname $MASTER_HOSTNAME --is_debug +# tllm.server --hostname $MASTER_HOSTNAME --model_path $MODEL_PATH --is_debug --config examples/config_one.json \ No newline at end of file diff --git a/examples/run_single_server.sh b/examples/run_single_server.sh index 606bd92..77383cd 100644 --- a/examples/run_single_server.sh +++ b/examples/run_single_server.sh @@ -3,8 +3,4 @@ MODEL_PATH=/Users/lujianghu/Documents/Llama-3.2-1B-Instruct # MODEL_PATH=Qwen/Qwen2-VL-2B-Instruct # MODEL_PATH=mlx-community/Meta-Llama-3.1-8B-Instruct-4bit -export PYTHONPATH="./":$PYTHONPATH; - -python3 -m tllm.entrypoints.api_server --model_path $MODEL_PATH --is_local --is_debug - - +tllm.server --model_path $MODEL_PATH --is_local --is_debug \ No newline at end of file diff --git a/flux_examples/run_client.sh b/flux_examples/run_client.sh index e45433a..1ef9451 100644 --- a/flux_examples/run_client.sh +++ b/flux_examples/run_client.sh @@ -2,7 +2,4 @@ # master 的地址,请求分配模型的节点 MASTER_URL=http://mac-mini:8022 -export OMP_NUM_THREADS=8; -export PYTHONPATH="./":$PYTHONPATH; - -python3 -m tllm.entrypoints.handler.handler --master_addr $MASTER_URL --is_debug +tllm.client --master_addr $MASTER_URL --is_debug diff --git a/flux_examples/run_server.sh b/flux_examples/run_server.sh index 2b8ef87..1702dfa 100644 --- a/flux_examples/run_server.sh +++ b/flux_examples/run_server.sh @@ -3,6 +3,4 @@ MODEL_PATH=/Users/lujianghu/Documents/flux/schnell_4bit MASTER_HOSTNAME=mac-mini -export PYTHONPATH="./":$PYTHONPATH; - -python3 -m tllm.entrypoints.api_server --hostname $MASTER_HOSTNAME --model_path $MODEL_PATH --client_size 1 --is_debug \ No newline at end of file +tllm.server --model_path $MODEL_PATH --hostname $MASTER_HOSTNAME --client_size 1 --is_debug \ No newline at end of file diff --git a/flux_examples/run_single_server.sh b/flux_examples/run_single_server.sh index ae6ddae..a54fe5d 100644 --- a/flux_examples/run_single_server.sh +++ b/flux_examples/run_single_server.sh @@ -1,7 +1,6 @@ #!/bin/bash MODEL_PATH=/Users/lujianghu/Documents/flux/schnell_4bit -export PYTHONPATH="./":$PYTHONPATH; -python3 -m tllm.entrypoints.api_server --model_path $MODEL_PATH --client_size 1 --is_local --is_debug --is_image +tllm.server --model_path $MODEL_PATH --client_size 1 --is_local --is_debug --is_image diff --git a/image/README/1734779816425.png b/image/README/1734779816425.png new file mode 100644 index 0000000..99c3a96 Binary files /dev/null and b/image/README/1734779816425.png differ diff --git a/image/README/1734779890405.png b/image/README/1734779890405.png new file mode 100644 index 0000000..5b2741b Binary files /dev/null and b/image/README/1734779890405.png differ diff --git a/image/README/1734779931105.png b/image/README/1734779931105.png new file mode 100644 index 0000000..553430f Binary files /dev/null and b/image/README/1734779931105.png differ diff --git a/requirements-mlx.txt b/requirements-mlx.txt index ac3d0d4..7a17cd2 100644 --- a/requirements-mlx.txt +++ b/requirements-mlx.txt @@ -13,8 +13,6 @@ gradio psutil grpcio==1.68.1 lz4==4.3.3 -mlx -mlx_lm==0.19.2 protobuf==5.28.3 pydantic==2.9.2 transformers==4.46.0 \ No newline at end of file diff --git a/requirements/base.txt b/requirements/base.txt new file mode 100644 index 0000000..19b2341 --- /dev/null +++ b/requirements/base.txt @@ -0,0 +1,18 @@ +aiohttp +fastapi +numpy +requests +tabulate +tqdm +typing_extensions +uvicorn +websockets +pillow +huggingface_hub +psutil +gradio==5.4.0 +grpcio==1.68.1 +lz4==4.3.3 +protobuf==5.28.3 +pydantic==2.9.2 +transformers==4.46.0 \ No newline at end of file diff --git a/requirements/mlx.txt b/requirements/mlx.txt new file mode 100644 index 0000000..d747310 --- /dev/null +++ b/requirements/mlx.txt @@ -0,0 +1,2 @@ +mlx +mlx_lm==0.19.2 \ No newline at end of file diff --git a/requirements/torch.txt b/requirements/torch.txt new file mode 100644 index 0000000..76f11f1 --- /dev/null +++ b/requirements/torch.txt @@ -0,0 +1 @@ +vllm \ No newline at end of file diff --git a/setup.py b/setup.py new file mode 100644 index 0000000..34cfecf --- /dev/null +++ b/setup.py @@ -0,0 +1,68 @@ +from setuptools import find_packages, setup + +# 基础依赖 +install_requires = [ + "aiohttp", + "fastapi", + "numpy", + "requests", + "tabulate", + "tqdm", + "typing_extensions", + "uvicorn", + "websockets", + "pillow", + "huggingface_hub", + "gradio", + "psutil", + "grpcio==1.68.1", + "lz4==4.3.3", + "protobuf==5.28.3", + "pydantic==2.9.2", + "transformers==4.46.0", +] + +# 平台特定依赖 +mlx_requires = ["mlx", "mlx_lm==0.19.2"] + +tinygrad_requires = [ + "tinygrad", +] + +torch_requires = [ + "vllm", +] + +# 可选功能依赖 +extras_require = { + "mlx": mlx_requires, + # 'tinygrad': tinygrad_requires, + "torch": torch_requires, + "all": mlx_requires + torch_requires, # 全部安装(可能在某些平台上无法使用) + "dev": [ + "black", + "isort", + ], +} + +setup( + name="tllm", + version="0.1.0", + packages=find_packages(), + install_requires=install_requires, + extras_require=extras_require, + python_requires=">=3.9", # 指定最低 Python 版本要求 + classifiers=[ + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.9", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + ], + entry_points={ + "console_scripts": [ + "tllm.server=tllm.entrypoints.api_server:main", + "tllm.client=tllm.entrypoints.handler.handler:main", + ], + }, +) diff --git a/tllm/entrypoints/api_server.py b/tllm/entrypoints/api_server.py index 8235969..cac8b31 100644 --- a/tllm/entrypoints/api_server.py +++ b/tllm/entrypoints/api_server.py @@ -220,6 +220,11 @@ async def run_server(args) -> None: await shutdown_task -if __name__ == "__main__": +def main(): + global args args = parse_master_args() asyncio.run(run_server(args)) + + +if __name__ == "__main__": + main() diff --git a/tllm/entrypoints/handler/handler.py b/tllm/entrypoints/handler/handler.py index 58a70ee..9a3b17f 100644 --- a/tllm/entrypoints/handler/handler.py +++ b/tllm/entrypoints/handler/handler.py @@ -184,6 +184,10 @@ async def run(args): await rpc_servicer.stop() -if __name__ == "__main__": +def main(): args = parse_handler_args() asyncio.run(run(args)) + + +if __name__ == "__main__": + main() diff --git a/tllm/entrypoints/utils.py b/tllm/entrypoints/utils.py index 95910b3..e7a85c9 100644 --- a/tllm/entrypoints/utils.py +++ b/tllm/entrypoints/utils.py @@ -13,20 +13,52 @@ def parse_master_args(): parser = argparse.ArgumentParser() - parser.add_argument("--model_path", type=str, required=True) - parser.add_argument("--hostname", type=str, required=False) - parser.add_argument("--grpc_port", type=int, default=None) - parser.add_argument("--http_port", type=int, default=8022) - parser.add_argument("--config", type=str, default=None, help="config file path") + parser.add_argument( + "--model_path", + type=str, + required=True, + help="Specify the path of the model file or huggingface repo. Like mlx-community/Llama-3.2-1B-Instruct-bf16", + ) + parser.add_argument("--hostname", type=str, help="The address of the client connection.") + parser.add_argument( + "--grpc_port", + type=int, + default=None, + help="Specify the port number used by the gRPC service. If this parameter is not provided, the default value (currently None, and the specific value may be determined by the program logic later) will be used.", + ) + parser.add_argument( + "--http_port", + type=int, + default=8022, + help="Specify the port number used by the HTTP service. The default value is 8022, and this port can be modified by passing in a parameter according to actual needs.", + ) + parser.add_argument( + "--config", + type=str, + default=None, + help="The path of the configuration file. If there is an additional configuration file to control the program's behavior, this parameter can be used to specify its path. By default, it is not specified.", + ) parser.add_argument( "--client_size", type=int, default=None, - help="the number of the client, if not provided, will be parsed from the model path and auto calculated", + help="The number of clients. If this parameter is not provided, the program will try to parse and automatically calculate the number from the model path.", + ) + parser.add_argument( + "--is_local", + action="store_true", + help="A boolean flag. If this parameter is specified in the command line, indicates that the model runs locally only", + ) + parser.add_argument( + "--is_debug", + action="store_true", + help="A boolean flag used to turn on or off the debug mode. If this parameter is specified, the program will print more logs", + ) + parser.add_argument( + "--is_image", + action="store_true", + help="A boolean flag. The specific meaning start the Vincennes Diagram service", ) - parser.add_argument("--is_local", action="store_true") - parser.add_argument("--is_debug", action="store_true") - parser.add_argument("--is_image", action="store_true") return parser.parse_args()