The original DeepSeek-R1
model is 671B parameters and requires either a single p5en.48xlarge
or four g6e.48xlarge
instances to run. Needless to say this is extremely expensive, therefore the next best option is to either run one of the several distilled version of the model that were provided by DeepSeek or run a quantized version of the model, in either case the model size is much reduced and it can therefore be run on smaller instances. Do these quantized or distilled models provide the same level of accuracy and reasoning capabilities as the original 671B model, that is matter of benchmarking (see FMBench
).
This repo provides a recipe to run distilled as well as quantized version of these models on Amazon EC2. While we have used the g6e
instance types that have the NVIDIA L40S GPU (48 GB GPU memory) but conceptually the instructions should work on a different instance type as well (adjust for tensor parallelism and how many layers to offload to GPU).
Here are the steps for running DeepSeek-R1 models with dynamic quantization. The material here is based on steps provided by unsloth.ai
. Dynamic quantization allows keeping most layers at 1.5bit and selectively quantizing certain layers to higher bits (like 5bit), this provides for faster token/sec without, hopefully, sacrificing much on accuracy.
-
Create an EC2 instance with the
Deep Learning OSS Nvidia Driver AMI GPU PyTorch
AMI andg6e.12xlarge
instance type. You can see step by step instructions here. -
SSH to your EC2 instance and clone this repo.
git clone https://github.com/aarora79/deepseek-r1-ec2
-
Add your Hugging Face token as an environment variable.
export HF_TOKEN=your_hf_token
-
Clone and build
llama.cpp
. This step takes about 10-minutes.apt-get update apt-get install build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggerganov/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-server cp llama.cpp/build/bin/llama-* llama.cpp
-
Download the snapshot of Deepseek R1 from HuggingFace. Unsloth.ai provides four dynamic qunaitized versions:
MoE Bits Disk Size Type 1.58-bit 131GB IQ1_S 1.73-bit 158GB IQ1_M 2.22-bit 183GB IQ2_XXS 2.51-bit 212GB Q2_K_XL The following cli commands download the 1.58-bit quantized versions. Adjust appropriately to try out other versions.
pip install hf_transfer pip install huggingface_hub[hf_transfer] export HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/DeepSeek-R1-GGUF \ --local-dir DeepSeek-R1-GGUF \ --include "*UD-IQ1_S*"
-
Run the
llama-server
to serve the model with an API endpoint. See documentation here to calculate layers to offload. Specifically for theg6e.12xlarge
which has 4 GPUs with 48GB memory each and the 1.58-bit quantized model which has a size of 131GB and considering that DeepSeek-R1 has 62 layers total, the max numbers of layers we can offload comes out to be 62 i.e. we can offload all layers$n_{offload} = \frac{VRAM(GB)}{Filesize(GB)} \times n_{layers} - 4$ $n_{offload} = \frac{196}{133} \times 62 - 4 = 87$ since 87 > 62 (total layers in the model) therefore we can simply offload all layers and hence--n-gpu-layers
is set to 62 below../llama.cpp/llama-server --model ${HOME}/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 48 --prio 3 --temp 0.6 --ctx-size 8192 --seed 3407 --n-gpu-layers 62 -np 4 --batch-size 4096 --numa distribute
Once the server starts succesfully you should see the following messages being printed out on the console:
main: model loaded main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='', is_first_sp=true) %}{%- for message in messages %}{%- if message['role'] == 'system' %}{%- if ns.is_first_sp %}{% set ns.system_prompt = ns.system_prompt + message['content'] %}{% set ns.is_first_sp = false %}{%- else %}{% set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] %}{%- endif %}{%- endif %}{%- endfor %}{{ bos_token }}{{ ns.system_prompt }}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' in message %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{%- if message['content'] is none %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- else %}{{'<|Assistant|>' + message['content'] + '<|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- endif %}{%- endfor %}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- if message['role'] == 'assistant' and 'tool_calls' not in message %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant <|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>' main: server is listening on http://127.0.0.1:8080 - starting the main loop
-
Now we should have an API endpoint to use we can use. We can send it a
cURL
command in the following format:MODEL_ID=DeepSeek-R1-UD-IQ1_S curl -X POST -H "Content-Type: application/json" -d '{"prompt": "Is gravity a particle?", "max_tokens": 500, "model": "$MODEL_ID"}' http://localhost:8080/completion |jq
You should now see a response such as the one given below. Note the
predicted_per_second
field in the response, this gives the token/sec value which in this example is ~13 tokens/second.{ "index": 0, "content": " Well, this is a question that has been debated among scientists for many years. Some believe that gravity is a force that is transmitted by particles, while others think it is simply a curvature in spacetime. In this article, we will explore both sides of the argument and see if we can come to a conclusion about what gravity really is. Stay tuned!\n\n## What is gravity?\n\nGravity is a force that exists between any two masses. It is the force that keeps us grounded on Earth, and it’s what keeps the planets in orbit around the sun. Gravity is also responsible for the tides and the formation of galaxies. But what exactly is gravity? Is it a particle or something else?\n\nThere are two main theories about gravity: the classical theory and the quantum theory. The classical theory, developed by Isaac Newton, states that gravity is a force that acts between masses. This theory works well for large objects like planets and stars, but it doesn’t explain how gravity works on a smaller scale, like between atoms.\n\nThe quantum theory, developed by Albert Einstein, states that gravity is not a force but rather a curvature of spacetime. This theory explains how gravity works on both large and small scales. However, it’s still not fully understood how this curvature happens.\n\nSo, is gravity a particle? The answer isn’t clear. Some scientists believe that gravity is a particle, while others think it’s something else. Until we have a better understanding of how gravity works, we won’t know for sure.\n\n## How does gravity work?\n\nGravity is one of the most fundamental forces in the universe. It’s what keeps us grounded on Earth, and it’s what keeps the planets in orbit around the sun. But how does gravity actually work?\n\nThe force of gravity is created by the curvature of spacetime. When an object with mass moves through spacetime, it creates a gravitational field. The more mass an object has, the greater the curvature of spacetime around it. This curvature then affects the motion of other objects, causing them to accelerate toward the original object.\n\nSo, in essence, gravity is a force that arises from the interaction between masses and spacetime. It’s not a particle, but rather a result of the way spacetime is curved by mass. This theory has been supported by observations of gravitational waves, which are ripples in spacetime caused by massive objects moving through it.\n\nWhile there is still much to learn about gravity, this explanation provides a good foundation for understanding how it works. And as we continue to study gravity, we may find even more fascinating ways that it shapes our universe.\n\n## The history of gravity\n\nThe history of gravity is a long and complicated one. For centuries, scientists have been trying to understand this force that seems to hold everything in the universe together. The first real breakthrough came in the 1600s when Isaac Newton developed his theory of gravity. This theory explained how gravity works on a large scale, like between planets and stars.", "tokens": [], "id_slot": 1, "stop": true, "model": "ds", "tokens_predicted": 600, "tokens_evaluated": 6, "generation_settings": { "n_predict": 600, "seed": 3407, "temperature": 0.6000000238418579, "dynatemp_range": 0, "dynatemp_exponent": 1, "top_k": 40, "top_p": 0.949999988079071, "min_p": 0.05000000074505806, "xtc_probability": 0, "xtc_threshold": 0.10000000149011612, "typical_p": 1, "repeat_last_n": 64, "repeat_penalty": 1, "presence_penalty": 0, "frequency_penalty": 0, "dry_multiplier": 0, "dry_base": 1.75, "dry_allowed_length": 2, "dry_penalty_last_n": 8192, "dry_sequence_breakers": [ "\n", ":", "\"", "*" ], "mirostat": 0, "mirostat_tau": 5, "mirostat_eta": 0.10000000149011612, "stop": [], "max_tokens": 600, "n_keep": 0, "n_discard": 0, "ignore_eos": false, "stream": false, "logit_bias": [], "n_probs": 0, "min_keep": 0, "grammar": "", "grammar_trigger_words": [], "grammar_trigger_tokens": [], "preserved_tokens": [], "samplers": [ "penalties", "dry", "top_k", "typ_p", "top_p", "min_p", "xtc", "temperature" ], "speculative.n_max": 16, "speculative.n_min": 5, "speculative.p_min": 0.8999999761581421, "timings_per_token": false, "post_sampling_probs": false, "lora": [] }, "prompt": "<|begin▁of▁sentence|>Is gravity a particle?", "has_new_line": true, "truncated": false, "stop_type": "limit", "stopping_word": "", "tokens_cached": 605, "timings": { "prompt_n": 6, "prompt_ms": 201.665, "prompt_per_token_ms": 33.61083333333333, "prompt_per_second": 29.752312002578535, "predicted_n": 600, "predicted_ms": 47212.818, "predicted_per_token_ms": 78.68803, "predicted_per_second": 12.708413211005539 } }
-
To run the
llama-server
in the background use therun_llama_server.sh
provided in this repo. This script starts thellama-server
if it is not running and restarts it if the CUDA memory utilization becomes greater than 95%. You can use this script to have thellama-server
running always so as to have a reliable endpoint for use with applications (such as a benchmarking task).# run ./run_llama_server without any parameters to see all command line options ./run_llama_server.sh -m ${HOME}/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --mem-threshold 99
-
Now we are ready to run a simple Conversational AI app. Run the following commands in a new terminal.
cd deepseek-r1-ec2 curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env uv venv && source .venv/bin/activate && uv pip sync pyproject.toml
-
Run the chat app and you should be all set!
chainlit run app_llama_server.py
-
We can also run inferences using two CLI programs provided in this repo. We can use either the
/chat/completions
endpoint or the/completion
endpoint.# run python chat_completions.py --help to see command line arguments python chat_completions.py --stream yes --input "What is 2+2?"
Note that with
/completion
endpoint we have to specify the prompt with<|User|>
and<|Assistant|>
tags for the model to<think>
otherwise the model is likely to skip the reasoning process leading to less accurate results.# run python completion.py --help to see command line arguments python completion.py --prompt "<|User|>What is 2+2?<|Assistant|>"
Here are the steps to serve the deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
and other models from the deepseek-r1
family of models on Amazon EC2 using vllm
. vLLM is a fast and easy-to-use library for LLM inference and serving.
The steps provided below are for the 32B distilled version but apply to other variants as well. To use
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
for example run this code on ag6e.12xlarge
Amazon EC2 instance.
-
Create an EC2 instance with the
Deep Learning OSS Nvidia Driver AMI GPU PyTorch
AMI andg6e.12xlarge
instance type. You can see step by step instructions here. At the time of this writing the AMI used forus-east-1
wasami-067bd563cecc90173
. -
SSH to your EC2 instance and install docker.
sudo apt-get update sudo apt-get install --reinstall docker.io -y sudo apt-get install -y docker-compose docker compose version
-
Clone this repo on your EC2 instance.
git clone https://github.com/aarora79/deepseek-r1-ec2
-
Add your Hugging Face token as an environment variable.
export HF_TOKEN=your_hf_token
-
SSH to your instance and run the following commands. Running the
deploy_model.sh
does the following:- Downloads the vllm container from Dockerhub.
- Start the container, this downloads the model from the Hugging Face hub.
- Create an endpoint accessible as
localhost:8000/v1/completions
to serve the model.
cd deepseek-r1-ec2 chmod +x deploy_model.sh # the container takes about 10-minutes to start # takes two command line arguments, model id and tensor parallel degree ./deploy_model.sh deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 4
-
Wait for 10-minutes, and then verify that the container is running.
docker ps
You should see an output similar to the following:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES e21df06d8a7a vllm/vllm-openai:v0.6.6.post1 "python3 -m vllm.ent…" 10 minutes ago Up 10 minutes 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp fmbench_model_container
-
Now you are ready to run a
cURL
command to get inference from the model.Here is an example for text inference.
curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "prompt": "What is the theory of quantum gravity?", "temperature": 0.6, "max_tokens": 512}' | jq
The above command will generate an output similar to the following:
{ "id": "cmpl-2be25863e97347f7b6d1999a50409add", "object": "text_completion", "created": 1738033499, "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "choices": [ { "index": 0, "text": " Explain it in simple terms.\n\n</think>\n\nQuantum gravity is a theoretical framework that seeks to reconcile the two major pillars of modern physics: quantum mechanics and general relativity. While quantum mechanics describes the behavior of particles at the smallest scales, general relativity explains the nature of gravity and the large-scale structure of the universe. The challenge is that these two theories are fundamentally incompatible, especially when it comes to understanding phenomena where both quantum effects and strong gravitational fields are significant, such as at the center of black holes or during the early moments of the Big Bang.\n\nThe theory of quantum gravity aims to unify these two perspectives into a single, coherent framework. One of the most promising approaches is string theory, which suggests that particles are not point-like but are instead tiny, vibrating strings. Another approach is loop quantum gravity, which attempts to quantize space itself by describing it in terms of discrete loops or \"spin networks.\" Both of these approaches, and others like them, are still under active research and have yet to be experimentally verified.\n\nIn simple terms, quantum gravity is an attempt to understand how the tiny, strange world of quantum mechanics interacts with the vast, curved spacetime of gravity. It's a quest to find a theory that can describe the universe at its most fundamental level, from the tiniest particles to the largest structures.", "logprobs": null, "finish_reason": "length", "stop_reason": null, "prompt_logprobs": null } ], "usage": { "prompt_tokens": 9, "total_tokens": 521, "completion_tokens": 512, "prompt_tokens_details": null } }
-
You can see traces from the serving container by running the following command:
docker logs -f fmbench_model_container
You should see an output similar to this:
INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO 01-27 19:04:07 logger.py:37] Received request cmpl-4ad5d844ee8342ebb26e4be1770005ca-0: prompt: 'What is the theory of quantum gravity?', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [151646, 3838, 374, 279, 10126, 315, 30128, 23249, 30], lora_request: None, prompt_adapter_request: None. INFO 01-27 19:04:07 engine.py:267] Added request cmpl-4ad5d844ee8342ebb26e4be1770005ca-0. INFO 01-27 19:04:07 metrics.py:467] Avg prompt throughput: 0.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 01-27 19:04:12 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 33.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO: 172.17.0.1:33604 - "POST /v1/completions HTTP/1.1" 200 OK INFO 01-27 19:04:24 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.9 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 01-27 19:04:34 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
-
You can use
FMBench
for benchmarking performance of this model.
The following table lists the instance types for use with different Deepseek-R1 variants. Support for the non-distilled model i.e. Deepseek-R1 with 671B parameters is expected soon, stay tuned for updates.
Model | Recommented EC2 instance type | TP Degree | Download |
---|---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | g6e.xlarge |
1 | 🤗 HuggingFace |
DeepSeek-R1-Distill-Qwen-7B | g6e.2xlarge |
1 | 🤗 HuggingFace |
DeepSeek-R1-Distill-Llama-8B | g6e.2xlarge |
1 | 🤗 HuggingFace |
DeepSeek-R1-Distill-Qwen-14B | g6e.12xlarge |
4 | 🤗 HuggingFace |
DeepSeek-R1-Distill-Qwen-32B | g6e.12xlarge |
4 | 🤗 HuggingFace |
DeepSeek-R1-Distill-Llama-70B | g6e.48xlarge |
8 | 🤗 HuggingFace |
You can run a simple conversation AI app included in this repo, follow steps below.
-
Create a new Python venv and install the dependencies for the application.
curl -LsSf https://astral.sh/uv/install.sh | sh uv venv && source .venv/bin/activate && uv pip sync pyproject.toml
-
Start the
vllm
model server.# set your HF token export HF_TOKEN=your_hf_token # change the model id and tensor parallel degree as appropriate MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B TP_DEGREE=1 ./deploy_model.sh $MODEL_ID $TP_DEGREE
-
Start the
Chainlit
app.chainlit run app.py --port 8001
-
At this time your browser should open (or you can click on the http://localhost:8001) to open it and you should be able to see a browser window with the Chainlit app.
-
Enjoy! Add a GitHub star to this repo if you found it useful 🙏.