You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+27-41
Original file line number
Diff line number
Diff line change
@@ -1,24 +1,29 @@
1
1
# ScaleLLM: An efficient LLM Inference solution
2
-
[](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml)[](https://github.com/vectorch-ai/ScaleLLM/stargazers)
[](https://opensource.org/licenses/Apache-2.0)[](https://github.com/vectorch-ai/ScaleLLM/stargazers)[](https://github.com/vectorch-ai/ScaleLLM/actions/workflows/build.yml)
> ScaleLLM is currently in the active development stage and may not yet provide the optimal level of inference efficiency. We are fully dedicated to continuously enhancing its efficiency while also adding more features.
7
+
[ScaleLLM]() is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including [Llama3](https://github.com/meta-llama/llama3), [Gemma](https://github.com/google-deepmind/gemma), Bloom, GPT-NeoX, and more.
10
8
9
+
ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our [**_Roadmap_**](https://github.com/vectorch-ai/ScaleLLM/issues/84) for more details.
11
10
12
-
In the coming weeks, we have exciting plans to focus on [**_speculative decoding_**](https://github.com/orgs/vectorch-ai/projects/1) and [**_stateful conversation_**](https://github.com/orgs/vectorch-ai/projects/2), alongside further kernel optimizations. We appreciate your understanding and look forward to delivering an even better solution.
13
11
12
+
## News:
13
+
*[03/2024] - [Advanced feature](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.7) support for CUDA graph, [dynamic prefix cache](), [dynamic chunked prefill]() and [speculative decoding]().
14
+
*[11/2023] - [First release](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1) with support for popular [open-source models](#supported-models).
14
15
15
-
## Latest News:
16
-
*[11/2023] - First [official release](https://github.com/vectorch-ai/ScaleLLM/releases/tag/v0.0.1) with support for popular open-source models.
16
+
## Key Features
17
17
18
+
-[High Efficiency](): Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Paged Attention](https://github.com/vllm-project/vllm), [Continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference), and more.
19
+
-[Tensor Parallelism](): Utilizes tensor parallelism for efficient model execution.
20
+
-[OpenAI-compatible API](): An efficient [golang](https://en.wikipedia.org/wiki/Go_(programming_language)) rest api server that compatible with OpenAI.
21
+
-[Huggingface models](): Seamless integration with most popular [HF models](#supported-models), supporting safetensors.
22
+
-[Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
23
+
-[Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
18
24
19
25
## Table of contents
20
26
21
-
-[Overview](#overview)
22
27
-[Supported Models](#supported-models)
23
28
-[Get Started](#get-started)
24
29
-[ScaleLLM server](#scalellm-server)
@@ -32,42 +37,20 @@ In the coming weeks, we have exciting plans to focus on [**_speculative decoding
32
37
-[Acknowledgements](#acknowledgements)
33
38
-[License](#license)
34
39
35
-
36
-
## Overview
37
-
38
-
ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama2, Bloom, GPT-NeoX, and more.
39
-
40
-
## Key Features
41
-
42
-
-[High Efficiency](): Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like [Flash Attention](https://github.com/Dao-AILab/flash-attention), [Paged Attention](https://github.com/vllm-project/vllm), [Continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference), and more.
43
-
-[Tensor Parallelism](): Utilizes tensor parallelism for efficient model execution.
44
-
-[OpenAI-compatible API](): An efficient [golang](https://en.wikipedia.org/wiki/Go_(programming_language)) rest api server that compatible with OpenAI.
45
-
-[Huggingface models](): Seamless integration with most popular [HF models](#supported-models), supporting safetensors.
46
-
-[Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
47
-
-[Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.
48
-
49
-
50
40
## Supported Models
51
41
52
-
Please note that in order to use Yi models, you need to add `--model_type=Yi` to the command line. For example:
53
-
```bash
54
-
docker run -it --gpus=all --net=host --shm-size=1g \
| Phi2 | Yes | Yes | No |[microsoft/phi-2](https://huggingface.co/microsoft/phi-2)|
@@ -96,9 +79,10 @@ You can download and install Docker from the official website: [Docker Installat
96
79
Once you have Docker installed, you can run ScaleLLM Docker container with [latest image](https://hub.docker.com/r/vectorchai/scalellm/tags) using the following command:
97
80
98
81
```bash
82
+
docker pull docker.io/vectorchai/scalellm:latest
99
83
docker run -it --gpus=all --net=host --shm-size=1g \
@@ -109,7 +93,7 @@ This command starts the Docker container with GPU support and various configurat
109
93
-`HF_MODEL_REVISION` specifies which Hugging Face model revision you want to run. By default, it is set to `"main"`.
110
94
-`DEVICE` specifies the device on which this model should run. By default, it is set to `"auto"`, using all available GPUs. You can also specify specific GPUs by using `"cuda:0,cuda:1"`, or use CPU by using `"cpu"`.
111
95
-`HF_MODEL_ALLOW_PATTERN` specifies which types of files are allowed to be downloaded. By default, it will be configured automatically based on tensor type. Only use this option if the default configuration is not working for you.
112
-
-`HUGGING_FACE_HUB_TOKEN` specifies the token from [huggingface](https://huggingface.co/settings/tokens) for gated models.
96
+
-`HUGGING_FACE_HUB_TOKEN` specifies the token from [huggingface](https://huggingface.co/settings/tokens) for gated models.`-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN`
113
97
114
98
> **Warning**<br />
115
99
> * The docker image with tag '[latest](https://hub.docker.com/r/vectorchai/scalellm/tags)' could be changed to a new version upon new release. In order to use latest image, you may need to repull the image with specific tag.
@@ -139,6 +123,7 @@ After running the Docker container, two ports are exposed:
139
123
You can also start a REST API gateway with [latest image](https://hub.docker.com/r/vectorchai/scalellm-gateway/tags) using the following command:
@@ -150,6 +135,7 @@ The REST API Server is available on `localhost:8080`. You can use REST API reque
150
135
A local Chatbot UI is also available on [localhost:3000](localhost:3000). You can start it with [latest image](https://hub.docker.com/r/vectorchai/chatbot-ui/tags) using the following command:
0 commit comments