-
Notifications
You must be signed in to change notification settings - Fork 25
Added examples for vllm speculative decoding with n-gram and eagle methods #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,304 @@ | ||
| # Online inference using vLLM with speculative decoding and GPUs on Google Kubernetes Engine (GKE) | ||
|
|
||
| This document implements online inference using GPUs on Google Kubernetes Engine | ||
| (GKE) using vLLM with Speculative Decoding enabled. | ||
|
|
||
| Speculative decoding is a powerful optimization technique that enhances LLM inference speed without compromising output quality. It utilizes a smaller, faster "draft" model or method to generate candidate tokens, which are then validated by the main, larger "target" model in a single, efficient step. This reduces the computational overhead and improves both throughput and inter-token latency. | ||
| vLLM supports several speculative decoding methods, each tailored to different use cases and performance requirements. See the [Speculative Decoding guide](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html) in the official vLLM docs for in depth concepts and examples. This guide will walk you through the implementation of the following Speculative Decoding methods with vLLM on GKE: | ||
|
|
||
| - [N-gram Based Speculative Decoding](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-by-matching-n-grams-in-the-prompt) | ||
|
|
||
| This method is particularly effective for tasks where the output is likely to contain sequences from the input prompt, such as summarization or question-answering. Instead of a draft model, it uses n-grams from the prompt to generate token proposals. | ||
|
|
||
| - [EAGLE Based Draft Models](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-using-eagle-based-draft-models) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want to pin to
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed to stable. |
||
|
|
||
| [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) is a state-of-the-art speculative decoding method that uses a lightweight draft model to generate multiple candidate tokens in parallel. | ||
|
|
||
| This example is built on top of the | ||
| [GKE Inference reference architecture](/docs/platforms/gke/base/use-cases/inference-ref-arch/README.md). | ||
|
|
||
| ## Before you begin | ||
|
|
||
| - The | ||
| [GKE Inference reference implementation](/platforms/gke/base/use-cases/inference-ref-arch/terraform/README.md) | ||
| is deployed and configured. | ||
|
|
||
| - Get access to the models. | ||
|
|
||
| - For Gemma: | ||
|
|
||
| - Consented to the license on [Kaggle](https://www.kaggle.com/) using a | ||
| Hugging Face account. | ||
| - [**google/gemma**](https://www.kaggle.com/models/google/gemma). | ||
|
|
||
| - For Llama: | ||
| - Accept the terms of the license on the Hugging Face model page. | ||
| - [**meta-llama/Llama-4-Scout-17B-16E-Instruct**](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) | ||
| - [**meta-llama/Llama-3.3-70B-Instruct**](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | ||
|
|
||
| - Ensure your | ||
| [Hugging Face Hub **Read** access token](/platforms/gke/base/core/huggingface/initialize/README.md) | ||
| has been added to Secret Manager. | ||
|
|
||
| ## Create and configure the Google Cloud resources | ||
|
|
||
| - Deploy the online GPU resources. | ||
|
|
||
| ```shell | ||
| cd ${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/online_gpu && \ | ||
| rm -rf .terraform/ terraform.tfstate* && \ | ||
| terraform init && \ | ||
| terraform plan -input=false -out=tfplan && \ | ||
| terraform apply -input=false tfplan && \ | ||
| rm tfplan | ||
| ``` | ||
|
|
||
| ## Download the models to Cloud Storage | ||
|
|
||
| - Choose the model. | ||
|
|
||
| - **Gemma 3 27B Instruction-Tuned**: | ||
|
|
||
| ```shell | ||
| export HF_MODEL_ID="google/gemma-3-27b-it" | ||
| ``` | ||
|
|
||
| - **Llama 3.3 70B Instruction-Tuned**: | ||
|
|
||
| ```shell | ||
| export HF_MODEL_ID="meta-llama/llama-3.3-70b-instruct" | ||
| ``` | ||
|
|
||
| - Source the environment configuration. | ||
|
|
||
| ```shell | ||
| source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh" | ||
| ``` | ||
|
|
||
| - Configure the model download job. | ||
|
|
||
| ```shell | ||
| "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/configure_huggingface.sh" | ||
| ``` | ||
|
|
||
| - Deploy the model download job. | ||
|
|
||
| ```shell | ||
| kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface" | ||
| ``` | ||
|
|
||
| - Watch the model download job until it is complete. | ||
|
|
||
| ```shell | ||
| watch --color --interval 5 --no-title \ | ||
| "kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} get job/${HF_MODEL_ID_HASH}-hf-model-to-gcs | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e 'Complete' | ||
| echo '\nLogs(last 10 lines):' | ||
| kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} logs job/${HF_MODEL_ID_HASH}-hf-model-to-gcs --all-containers --tail 10" | ||
| ``` | ||
|
|
||
| When the job is complete, you will see the following: | ||
|
|
||
| ```text | ||
| NAME STATUS COMPLETIONS DURATION AGE | ||
| XXXXXXXX-hf-model-to-gcs Complete 1/1 ### ### | ||
| ``` | ||
|
|
||
| You can press `CTRL`+`c` to terminate the watch. | ||
|
|
||
| - Delete the model download job. | ||
|
|
||
| ```shell | ||
| kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface" | ||
| ``` | ||
|
|
||
| ## Deploy the inference workload | ||
|
|
||
| - Source the environment configuration. | ||
|
|
||
| ```shell | ||
| source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh" | ||
| ``` | ||
|
|
||
| - Configure the deployment. | ||
|
|
||
| ```shell | ||
| "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/configure_vllm_spec_decoding.sh" | ||
| ``` | ||
|
|
||
| - Set the environment variables for the workload. | ||
|
|
||
| - Check the model name. | ||
|
|
||
| ```shell | ||
| echo "HF_MODEL_NAME=${HF_MODEL_NAME}" | ||
| ``` | ||
|
|
||
| > If the `HF_MODEL_NAME` variable is not set, ensure that `HF_MODEL_ID` is | ||
| > set and source the `set_environment_variables.sh` script: | ||
| > | ||
| > ```shell | ||
| > source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"` | ||
| > ``` | ||
|
|
||
| - Select an accelerator. | ||
|
|
||
| | Model | h100 | h200 | | ||
| | ---------------------- | ---- | ---- | | ||
| | gemma-3-27b-it | ✅ | ✅ | | ||
| | llama-3.3-70b-instruct | ✅ | ✅ | | ||
|
Comment on lines
+147
to
+148
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are no |
||
|
|
||
| - **NVIDIA H100 80GB**: | ||
|
|
||
| ```shell | ||
| export ACCELERATOR_TYPE="h100" | ||
| ``` | ||
|
|
||
| - **NVIDIA H200 141GB**: | ||
|
|
||
| ```shell | ||
| export ACCELERATOR_TYPE="h200" | ||
| ``` | ||
|
|
||
| Ensure that you have enough quota in your project to provision the selected | ||
| accelerator type. For more information, see about viewing GPU quotas, see | ||
| [Allocation quotas: GPU quota](https://cloud.google.com/compute/resource-usage#gpu_quota). | ||
|
|
||
| The Kubernetes manifests invoked below are based on the | ||
| [Inference Quickstart recommendations](https://cloud.google.com/kubernetes-engine/docs/how-to/machine-learning/inference-quickstart). | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this true?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe so, we suggest this is the case in the Online Serving reference architecture (https://github.com/GoogleCloudPlatform/accelerated-platforms/blob/main/docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-with-hf-model.md#deploy-the-inference-workload) |
||
|
|
||
| ### Speculative Decoding with ngram | ||
|
Check warning on line 169 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md
|
||
|
|
||
| - Deploy the inference workload. | ||
|
|
||
| ```shell | ||
| kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-gemma-3-27b-it-sd-ngram" | ||
|
Check warning on line 174 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably needs to be parameterized for model and accelerator.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Parameterized the accelerator type and model name. |
||
| ``` | ||
|
|
||
| - Watch the deployment until it is ready. | ||
|
|
||
| ```shell | ||
| watch --color --interval 5 --no-title "kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} get deployment/vllm-h100-gemma-3-27b-it-sd-ngram | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e '1/1 1 1' | ||
|
Check warning on line 180 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably needs to be parameterized for model and accelerator.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Parameterized the accelerator type and model name. |
||
| echo '\nLogs(last 10 lines):' | ||
| kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} logs deployment/vllm-h100-gemma-3-27b-it-sd-ngram --all-containers --tail 10" | ||
|
Check warning on line 182 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably needs to be parameterized for model and accelerator.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Parameterized the accelerator type and model name. |
||
| ``` | ||
|
|
||
| - When the deployment is ready, you will see the following: | ||
|
|
||
| ```text | ||
| NAME READY UP-TO-DATE AVAILABLE AGE | ||
| vllm-h100-gemma-3-27b-it-sd-ngram 1/1 1 1 ### | ||
|
Check warning on line 189 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md
|
||
| ``` | ||
|
|
||
| You can press `CTRL`+`c` to terminate the watch. | ||
|
|
||
| - Send a test request to the model. | ||
|
|
||
| Start a port forward to the model service. | ||
|
|
||
| ```shell | ||
| kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} port-forward service/vllm-h100-gemma-3-27b-it-sd-ngram 8000:8000 >/dev/null & \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably needs to be parameterized for model and accelerator.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Parameterized the accelerator type and model name. |
||
| PF_PID=$! | ||
| ``` | ||
|
|
||
| Send a test request. | ||
|
|
||
| ```shell | ||
| curl http://127.0.0.1:8000/v1/chat/completions \ | ||
| --data '{ | ||
| "model": "/gcs/'${HF_MODEL_ID}'", | ||
| "messages": [ { "role": "user", "content": "Why is the sky blue?" } ] | ||
| }' \ | ||
| --header "Content-Type: application/json" \ | ||
| --request POST \ | ||
| --show-error \ | ||
| --silent | jq | ||
| ``` | ||
|
|
||
| Stop the port forward. | ||
|
|
||
| ```shell | ||
| kill -9 ${PF_PID} | ||
| ``` | ||
|
|
||
| - Delete the workload. | ||
|
|
||
| ```shell | ||
| kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-gemma-3-27b-it-sd-ngram" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably needs to be parameterized for model and accelerator. |
||
| ``` | ||
|
|
||
| ### Speculative Decoding with Eagle | ||
|
|
||
| - Deploy the inference workload. | ||
|
|
||
| ```shell | ||
| kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-llama-3-70b-it-sd-eagle" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably needs to be parameterized for model and accelerator. |
||
| ``` | ||
|
|
||
| - Watch the deployment until it is ready. | ||
|
|
||
| ```shell | ||
| watch --color --interval 5 --no-title "kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} get deployment/vllm-h100-llama-3-70b-it-sd-eagle | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e '1/1 1 1' | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably needs to be parameterized for model and accelerator. |
||
| echo '\nLogs(last 10 lines):' | ||
| kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} logs deployment/vllm-h100-llama-3-70b-it-sd-eagle --all-containers --tail 10" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably needs to be parameterized for model and accelerator. |
||
| ``` | ||
|
|
||
| When the deployment is ready, you will see the following: | ||
|
|
||
| ```text | ||
| NAME READY UP-TO-DATE AVAILABLE AGE | ||
| vllm-h100-llama-3-70b-it-sd-eagle 1/1 1 1 ### | ||
| ``` | ||
|
|
||
| You can press `CTRL`+`c` to terminate the watch. | ||
|
|
||
| - Send a test request to the model. | ||
|
|
||
| Start a port forward to the model service. | ||
|
|
||
| ```shell | ||
| kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} port-forward service/vllm-h100-llama-3-70b-it-sd-eagle 8000:8000 >/dev/null & \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably needs to be parameterized for model and accelerator. |
||
| PF_PID=$! | ||
| ``` | ||
|
|
||
| Send a test request. | ||
|
|
||
| ```shell | ||
| curl http://127.0.0.1:8000/v1/chat/completions \ | ||
| --data '{ | ||
| "model": "/gcs/'${HF_MODEL_ID}'", | ||
| "messages": [ { "role": "user", "content": "Why is the sky blue?" } ] | ||
| }' \ | ||
| --header "Content-Type: application/json" \ | ||
| --request POST \ | ||
| --show-error \ | ||
| --silent | jq | ||
| ``` | ||
|
|
||
| Stop the port forward. | ||
|
|
||
| ```shell | ||
| kill -9 ${PF_PID} | ||
| ``` | ||
|
|
||
| - Delete the workload. | ||
|
|
||
| ```shell | ||
| kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-llama-3-70b-it-sd-eagle" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This probably needs to be parameterized for model and accelerator. |
||
| ``` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| If you experience any issue while deploying the workload, see the | ||
| [Online inference with GPUs Troubleshooting](/docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/troubleshooting.md) | ||
| guide. | ||
|
|
||
| ## Clean up | ||
|
|
||
| - Destroy the online GPU resources. | ||
|
|
||
| ```shell | ||
| cd ${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/online_gpu && \ | ||
| rm -rf .terraform/ terraform.tfstate* && \ | ||
| terraform init && | ||
| terraform destroy -auto-approve | ||
| ``` | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to pin to
v0.11.0of the docs or usestable?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to stable.