Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,304 @@
# Online inference using vLLM with speculative decoding and GPUs on Google Kubernetes Engine (GKE)

This document implements online inference using GPUs on Google Kubernetes Engine
(GKE) using vLLM with Speculative Decoding enabled.

Speculative decoding is a powerful optimization technique that enhances LLM inference speed without compromising output quality. It utilizes a smaller, faster "draft" model or method to generate candidate tokens, which are then validated by the main, larger "target" model in a single, efficient step. This reduces the computational overhead and improves both throughput and inter-token latency.
vLLM supports several speculative decoding methods, each tailored to different use cases and performance requirements. See the [Speculative Decoding guide](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html) in the official vLLM docs for in depth concepts and examples. This guide will walk you through the implementation of the following Speculative Decoding methods with vLLM on GKE:

- [N-gram Based Speculative Decoding](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-by-matching-n-grams-in-the-prompt)
Copy link
Collaborator

@arueth arueth Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to pin to v0.11.0 of the docs or use stable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to stable.


This method is particularly effective for tasks where the output is likely to contain sequences from the input prompt, such as summarization or question-answering. Instead of a draft model, it uses n-grams from the prompt to generate token proposals.

- [EAGLE Based Draft Models](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-using-eagle-based-draft-models)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to pin to v0.11.0 of the docs or use latest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to stable.


[EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) is a state-of-the-art speculative decoding method that uses a lightweight draft model to generate multiple candidate tokens in parallel.

This example is built on top of the
[GKE Inference reference architecture](/docs/platforms/gke/base/use-cases/inference-ref-arch/README.md).

## Before you begin

- The
[GKE Inference reference implementation](/platforms/gke/base/use-cases/inference-ref-arch/terraform/README.md)
is deployed and configured.

- Get access to the models.

- For Gemma:

- Consented to the license on [Kaggle](https://www.kaggle.com/) using a
Hugging Face account.
- [**google/gemma**](https://www.kaggle.com/models/google/gemma).

- For Llama:
- Accept the terms of the license on the Hugging Face model page.
- [**meta-llama/Llama-4-Scout-17B-16E-Instruct**](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)
- [**meta-llama/Llama-3.3-70B-Instruct**](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)

- Ensure your
[Hugging Face Hub **Read** access token](/platforms/gke/base/core/huggingface/initialize/README.md)
has been added to Secret Manager.

## Create and configure the Google Cloud resources

- Deploy the online GPU resources.

```shell
cd ${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/online_gpu && \
rm -rf .terraform/ terraform.tfstate* && \
terraform init && \
terraform plan -input=false -out=tfplan && \
terraform apply -input=false tfplan && \
rm tfplan
```

## Download the models to Cloud Storage

- Choose the model.

- **Gemma 3 27B Instruction-Tuned**:

```shell
export HF_MODEL_ID="google/gemma-3-27b-it"
```

- **Llama 3.3 70B Instruction-Tuned**:

```shell
export HF_MODEL_ID="meta-llama/llama-3.3-70b-instruct"
```

- Source the environment configuration.

```shell
source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
```

- Configure the model download job.

```shell
"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/configure_huggingface.sh"
```

- Deploy the model download job.

```shell
kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"
```

- Watch the model download job until it is complete.

```shell
watch --color --interval 5 --no-title \
"kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} get job/${HF_MODEL_ID_HASH}-hf-model-to-gcs | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e 'Complete'
echo '\nLogs(last 10 lines):'
kubectl --namespace=${huggingface_hub_downloader_kubernetes_namespace_name} logs job/${HF_MODEL_ID_HASH}-hf-model-to-gcs --all-containers --tail 10"
```

When the job is complete, you will see the following:

```text
NAME STATUS COMPLETIONS DURATION AGE
XXXXXXXX-hf-model-to-gcs Complete 1/1 ### ###
```

You can press `CTRL`+`c` to terminate the watch.

- Delete the model download job.

```shell
kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/model-download/huggingface"
```

## Deploy the inference workload

- Source the environment configuration.

```shell
source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"
```

- Configure the deployment.

```shell
"${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/configure_vllm_spec_decoding.sh"
```

- Set the environment variables for the workload.

- Check the model name.

```shell
echo "HF_MODEL_NAME=${HF_MODEL_NAME}"
```

> If the `HF_MODEL_NAME` variable is not set, ensure that `HF_MODEL_ID` is
> set and source the `set_environment_variables.sh` script:
>
> ```shell
> source "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/_shared_config/scripts/set_environment_variables.sh"`
> ```

- Select an accelerator.

| Model | h100 | h200 |
| ---------------------- | ---- | ---- |
| gemma-3-27b-it | ✅ | ✅ |
| llama-3.3-70b-instruct | ✅ | ✅ |
Comment on lines +147 to +148
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no h200 manifests.


- **NVIDIA H100 80GB**:

```shell
export ACCELERATOR_TYPE="h100"
```

- **NVIDIA H200 141GB**:

```shell
export ACCELERATOR_TYPE="h200"
```

Ensure that you have enough quota in your project to provision the selected
accelerator type. For more information, see about viewing GPU quotas, see
[Allocation quotas: GPU quota](https://cloud.google.com/compute/resource-usage#gpu_quota).

The Kubernetes manifests invoked below are based on the
[Inference Quickstart recommendations](https://cloud.google.com/kubernetes-engine/docs/how-to/machine-learning/inference-quickstart).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Speculative Decoding with ngram

Check warning on line 169 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

View workflow job for this annotation

GitHub Actions / checks

Unknown word (ngram)

Check warning on line 169 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

View workflow job for this annotation

GitHub Actions / checks

Unknown word (ngram)

- Deploy the inference workload.

```shell
kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-gemma-3-27b-it-sd-ngram"

Check warning on line 174 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

View workflow job for this annotation

GitHub Actions / checks

Unknown word (ngram)

Check warning on line 174 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

View workflow job for this annotation

GitHub Actions / checks

Unknown word (ngram)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameterized the accelerator type and model name.

```

- Watch the deployment until it is ready.

```shell
watch --color --interval 5 --no-title "kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} get deployment/vllm-h100-gemma-3-27b-it-sd-ngram | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e '1/1 1 1'

Check warning on line 180 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

View workflow job for this annotation

GitHub Actions / checks

Unknown word (ngram)

Check warning on line 180 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

View workflow job for this annotation

GitHub Actions / checks

Unknown word (ngram)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameterized the accelerator type and model name.

echo '\nLogs(last 10 lines):'
kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} logs deployment/vllm-h100-gemma-3-27b-it-sd-ngram --all-containers --tail 10"

Check warning on line 182 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

View workflow job for this annotation

GitHub Actions / checks

Unknown word (ngram)

Check warning on line 182 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

View workflow job for this annotation

GitHub Actions / checks

Unknown word (ngram)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameterized the accelerator type and model name.

```

- When the deployment is ready, you will see the following:

```text
NAME READY UP-TO-DATE AVAILABLE AGE
vllm-h100-gemma-3-27b-it-sd-ngram 1/1 1 1 ###

Check warning on line 189 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

View workflow job for this annotation

GitHub Actions / checks

Unknown word (ngram)

Check warning on line 189 in docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

View workflow job for this annotation

GitHub Actions / checks

Unknown word (ngram)
```

You can press `CTRL`+`c` to terminate the watch.

- Send a test request to the model.

Start a port forward to the model service.

```shell
kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} port-forward service/vllm-h100-gemma-3-27b-it-sd-ngram 8000:8000 >/dev/null & \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameterized the accelerator type and model name.

PF_PID=$!
```

Send a test request.

```shell
curl http://127.0.0.1:8000/v1/chat/completions \
--data '{
"model": "/gcs/'${HF_MODEL_ID}'",
"messages": [ { "role": "user", "content": "Why is the sky blue?" } ]
}' \
--header "Content-Type: application/json" \
--request POST \
--show-error \
--silent | jq
```

Stop the port forward.

```shell
kill -9 ${PF_PID}
```

- Delete the workload.

```shell
kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-gemma-3-27b-it-sd-ngram"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

```

### Speculative Decoding with Eagle

- Deploy the inference workload.

```shell
kubectl apply --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-llama-3-70b-it-sd-eagle"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

```

- Watch the deployment until it is ready.

```shell
watch --color --interval 5 --no-title "kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} get deployment/vllm-h100-llama-3-70b-it-sd-eagle | GREP_COLORS='mt=01;92' egrep --color=always -e '^' -e '1/1 1 1'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

echo '\nLogs(last 10 lines):'
kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} logs deployment/vllm-h100-llama-3-70b-it-sd-eagle --all-containers --tail 10"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

```

When the deployment is ready, you will see the following:

```text
NAME READY UP-TO-DATE AVAILABLE AGE
vllm-h100-llama-3-70b-it-sd-eagle 1/1 1 1 ###
```

You can press `CTRL`+`c` to terminate the watch.

- Send a test request to the model.

Start a port forward to the model service.

```shell
kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} port-forward service/vllm-h100-llama-3-70b-it-sd-eagle 8000:8000 >/dev/null & \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

PF_PID=$!
```

Send a test request.

```shell
curl http://127.0.0.1:8000/v1/chat/completions \
--data '{
"model": "/gcs/'${HF_MODEL_ID}'",
"messages": [ { "role": "user", "content": "Why is the sky blue?" } ]
}' \
--header "Content-Type: application/json" \
--request POST \
--show-error \
--silent | jq
```

Stop the port forward.

```shell
kill -9 ${PF_PID}
```

- Delete the workload.

```shell
kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-llama-3-70b-it-sd-eagle"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

```

## Troubleshooting

If you experience any issue while deploying the workload, see the
[Online inference with GPUs Troubleshooting](/docs/platforms/gke/base/use-cases/inference-ref-arch/online-inference-gpu/troubleshooting.md)
guide.

## Clean up

- Destroy the online GPU resources.

```shell
cd ${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/terraform/online_gpu && \
rm -rf .terraform/ terraform.tfstate* && \
terraform init &&
terraform destroy -auto-approve
```
Loading
Loading