diff --git a/README.md b/README.md index 43da7e9fe..f5d1dec83 100644 --- a/README.md +++ b/README.md @@ -206,6 +206,7 @@ Practical deployment and model usage guides for Nemotron models. |-------|----------|--------------|-----------| | [**Nemotron 3 Super 120B A12B**](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16) | Production deployments needing strong reasoning | 1M context, in NVFP4 single B200, RAG & tool calling | [Cookbooks](./usage-cookbook/Nemotron-3-Super) | | [**Nemotron 3 Nano 30B A3B**](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) | Resource-constrained environments | 1M context, sparse MoE hybrid Mamba-2, controllable reasoning | [Cookbooks](./usage-cookbook/Nemotron-3-Nano) | +| [**Llama-3.1-Nemotron-Nano-8B-v1**](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1) | Small-footprint OCI deployments | Validated on private OKE in Phoenix with `vLLM`, OCI Bastion service, tool calling, and OpenAI-compatible `/v1` inference; provides a reproducible OCI path comparable to common AWS GPU/Kubernetes deployment patterns | [Cookbooks](./usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1) | | [**NVIDIA-Nemotron-Nano-12B-v2-VL**](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL) | Document intelligence and video understanding | 12B VLM, video reasoning, Efficient Video Sampling | [Cookbooks](./usage-cookbook/Nemotron-Nano2-VL/) | | [**Llama-3.1-Nemotron-Safety-Guard-8B-v3**](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Safety-Guard-8B-v3) | Multilingual content moderation | 9 languages, 23 safety categories | [Cookbooks](./usage-cookbook/Llama-3.1-Nemotron-Safety-Guard-V3/) | | **Nemotron-Parse** | Document parsing for RAG and AI agents | Table extraction, semantic segmentation | [Cookbooks](./usage-cookbook/Nemotron-Parse-v1.1/) | diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/README.md b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/README.md new file mode 100644 index 000000000..9f669737b --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/README.md @@ -0,0 +1,874 @@ +# Llama-3.1-Nemotron-Nano-8B-v1 on OCI OKE (Private Deployment) + +This cookbook documents a validated private deployment of +`nvidia/Llama-3.1-Nemotron-Nano-8B-v1` on **Oracle Cloud Infrastructure (OCI)** +using a private OKE cluster, a single `VM.GPU.A10.1` worker, and `vLLM` with an +OpenAI-compatible `/v1` endpoint. + +Based on the [Deploy OpenAI vLLM Production Stack on OKE](https://docs.oracle.com/en/learn/deploy-vllm-production-stack-oke/index.html) +guide, customized for the Nemotron model with tool calling support. + +## Tested environment + +- Region: `us-phoenix-1` +- Kubernetes: OKE v1.31.10, enhanced cluster +- GPU shape: `VM.GPU.A10.1` (NVIDIA A10, 24 GB) +- CPU shape: `VM.Standard.E5.Flex` +- Model: `nvidia/Llama-3.1-Nemotron-Nano-8B-v1` +- Serving stack: `vLLM v0.19.0` +- Helm chart: `vllm/vllm-stack` 0.1.10 +- Inference API: OpenAI-compatible `/v1` + +## Validated capabilities + +- Chat completion +- Tool / function calling +- Streaming +- Async / concurrent requests +- OpenAI-compatible model discovery via `/v1/models` + +## Prerequisites + +- OCI tenancy with GPU capacity (`VM.GPU.A10.1`) +- `oci` CLI configured with a valid profile +- `kubectl`, `helm`, `ssh`, `jq` +- An SSH key pair (e.g., `~/.ssh/id_ed25519`) + +**Note:** The NVIDIA device plugin is pre-installed on OKE enhanced clusters. +No manual installation is required. + +## Architecture + +``` + ┌─────────────────────────────────────────────────┐ + │ VCN 10.0.0.0/16 │ + You ──SSH tunnel──► │ │ + (localhost:6443) │ ┌──────────┐ ┌──────────────────────────┐ │ + │ │ │ Bastion │ │ API subnet (private) │ │ + │ │ │ subnet │────►│ OKE control plane │ │ + │ │ │ (public) │ │ :6443 │ │ + ▼ │ └──────────┘ └──────────────────────────┘ │ + kubectl / curl │ │ + │ ┌──────────────────────────────────────────┐ │ + │ │ Worker subnet (private) │ │ + │ │ │ │ + │ │ ┌─────────────┐ ┌──────────────────┐ │ │ + │ │ │ CPU node │ │ GPU node (A10) │ │ │ + │ │ │ router pod │ │ Nemotron engine │ │ │ + │ │ └─────────────┘ └──────────────────┘ │ │ + │ └──────────────────────────────────────────┘ │ + └─────────────────────────────────────────────────┘ +``` + +## Step 1: Set environment variables + +```bash +export OCI_COMPARTMENT_ID="" +export OCI_REGION="us-phoenix-1" +export OCI_PROFILE="DEFAULT" # adjust to your OCI CLI profile +export CLUSTER_NAME="nemotron-phx" +export KUBERNETES_VERSION="v1.31.10" +``` + +## Step 2: Create VCN and networking + +```bash +VCN_ID=$(oci network vcn create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --display-name "${CLUSTER_NAME}-vcn" \ + --cidr-blocks '["10.0.0.0/16"]' \ + --dns-label "nemotron" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) + +IGW_ID=$(oci network internet-gateway create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --vcn-id "${VCN_ID}" \ + --display-name "${CLUSTER_NAME}-igw" \ + --is-enabled true \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) + +NAT_ID=$(oci network nat-gateway create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --vcn-id "${VCN_ID}" \ + --display-name "${CLUSTER_NAME}-nat" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) + +SGW_SERVICE_ID=$(oci network service list \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data[?contains(name, 'All') && contains(name, 'Services')].id | [0]" \ + --raw-output) + +SGW_SERVICE_NAME=$(oci network service list \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data[?contains(name, 'All') && contains(name, 'Services')].\"cidr-block\" | [0]" \ + --raw-output) + +SGW_ID=$(oci network service-gateway create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --vcn-id "${VCN_ID}" \ + --display-name "${CLUSTER_NAME}-sgw" \ + --services "[{\"serviceId\": \"${SGW_SERVICE_ID}\"}]" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) + +PRIVATE_RT_ID=$(oci network route-table create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --vcn-id "${VCN_ID}" \ + --display-name "${CLUSTER_NAME}-private-rt" \ + --route-rules "[ + {\"cidrBlock\": \"0.0.0.0/0\", \"networkEntityId\": \"${NAT_ID}\"}, + {\"destination\": \"${SGW_SERVICE_NAME}\", \"destinationType\": \"SERVICE_CIDR_BLOCK\", \"networkEntityId\": \"${SGW_ID}\"} + ]" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) + +PUBLIC_RT_ID=$(oci network route-table create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --vcn-id "${VCN_ID}" \ + --display-name "${CLUSTER_NAME}-public-rt" \ + --route-rules "[{\"cidrBlock\": \"0.0.0.0/0\", \"networkEntityId\": \"${IGW_ID}\"}]" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) + +SL_ID=$(oci network security-list create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --vcn-id "${VCN_ID}" \ + --display-name "${CLUSTER_NAME}-sl" \ + --egress-security-rules '[{"destination": "0.0.0.0/0", "protocol": "all", "isStateless": false}]' \ + --ingress-security-rules '[ + {"source": "0.0.0.0/0", "protocol": "6", "isStateless": false, "tcpOptions": {"destinationPortRange": {"min": 22, "max": 22}}}, + {"source": "10.0.0.0/16", "protocol": "all", "isStateless": false}, + {"source": "10.244.0.0/16", "protocol": "all", "isStateless": false}, + {"source": "10.96.0.0/16", "protocol": "all", "isStateless": false}, + {"source": "0.0.0.0/0", "protocol": "1", "isStateless": false, "icmpOptions": {"type": 3, "code": 4}} + ]' \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) +``` + +Create four subnets: + +```bash +API_SUBNET_ID=$(oci network subnet create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --vcn-id "${VCN_ID}" \ + --display-name "${CLUSTER_NAME}-api-subnet" \ + --cidr-block "10.0.0.0/28" \ + --route-table-id "${PRIVATE_RT_ID}" \ + --security-list-ids "[\"${SL_ID}\"]" \ + --dns-label "kubeapi" \ + --prohibit-public-ip-on-vnic true \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) + +WORKER_SUBNET_ID=$(oci network subnet create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --vcn-id "${VCN_ID}" \ + --display-name "${CLUSTER_NAME}-worker-subnet" \ + --cidr-block "10.0.10.0/24" \ + --route-table-id "${PRIVATE_RT_ID}" \ + --security-list-ids "[\"${SL_ID}\"]" \ + --dns-label "workers" \ + --prohibit-public-ip-on-vnic true \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) + +LB_SUBNET_ID=$(oci network subnet create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --vcn-id "${VCN_ID}" \ + --display-name "${CLUSTER_NAME}-lb-subnet" \ + --cidr-block "10.0.20.0/24" \ + --route-table-id "${PUBLIC_RT_ID}" \ + --security-list-ids "[\"${SL_ID}\"]" \ + --dns-label "loadbalancers" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) + +BASTION_SUBNET_ID=$(oci network subnet create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --vcn-id "${VCN_ID}" \ + --display-name "${CLUSTER_NAME}-bastion-subnet" \ + --cidr-block "10.0.30.0/24" \ + --route-table-id "${PUBLIC_RT_ID}" \ + --security-list-ids "[\"${SL_ID}\"]" \ + --dns-label "bastion" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) +``` + +## Step 3: Create private OKE cluster + +```bash +oci ce cluster create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --name "${CLUSTER_NAME}" \ + --vcn-id "${VCN_ID}" \ + --kubernetes-version "${KUBERNETES_VERSION}" \ + --endpoint-subnet-id "${API_SUBNET_ID}" \ + --service-lb-subnet-ids "[\"${LB_SUBNET_ID}\"]" \ + --endpoint-public-ip-enabled false \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" +``` + +Wait for the cluster to become ACTIVE (~10 minutes): + +```bash +# Poll until ACTIVE +CLUSTER_ID=$(oci ce cluster list \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --name "${CLUSTER_NAME}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query 'data[0].id' --raw-output) + +watch -n 30 "oci ce cluster get --cluster-id ${CLUSTER_ID} \ + --profile ${OCI_PROFILE} --region ${OCI_REGION} \ + --query 'data.\"lifecycle-state\"' --raw-output" +``` + +**Do not proceed to Step 5 until the cluster is ACTIVE.** + +## Step 4: Create OCI Bastion + +The bastion is placed on the public bastion subnet so the OCI Bastion managed +service can accept inbound SSH connections. The port-forwarding session then +tunnels traffic to the private API endpoint over VCN-internal routing. + +> Known issue on OpenSSH 10.x (macOS 15+, some recent Linux): port-forwarding +> sessions close immediately after auth. If `ssh -V` reports 10.x, use +> [Appendix A](#appendix-a-jump-host-vm-alternative-openssh-10x) instead. + +```bash +BASTION_ID=$(oci bastion bastion create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --bastion-type STANDARD \ + --target-subnet-id "${BASTION_SUBNET_ID}" \ + --name "${CLUSTER_NAME}-bastion" \ + --client-cidr-list "[\"$(curl -s https://ifconfig.me)/32\"]" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) +``` + +## Step 5: Create node pools + +Find the GPU-compatible node image and create both pools: + +```bash +GPU_IMAGE_ID=$(oci ce node-pool-options get \ + --node-pool-option-id all \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.sources[?contains(\"source-name\", 'GPU') && \ + contains(\"source-name\", 'OKE-${KUBERNETES_VERSION#v}')].\"image-id\" | [0]" \ + --raw-output) + +CPU_IMAGE_ID=$(oci ce node-pool-options get \ + --node-pool-option-id all \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.sources[?contains(\"source-name\", 'OKE-${KUBERNETES_VERSION#v}') && \ + !contains(\"source-name\", 'GPU') && \ + contains(\"source-name\", 'aarch64')==\`false\`].\"image-id\" | [0]" \ + --raw-output) + +# Verify both image IDs were found +echo "GPU image: ${GPU_IMAGE_ID}" +echo "CPU image: ${CPU_IMAGE_ID}" +# If either is empty, list available images and pick manually: +# oci ce node-pool-options get --node-pool-option-id all \ +# --compartment-id "${OCI_COMPARTMENT_ID}" \ +# --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ +# --query "data.sources[?contains(\"source-name\", 'OKE-${KUBERNETES_VERSION#v}')].{name:\"source-name\",id:\"image-id\"}" \ +# --output table + +# Pick an availability domain with A10 capacity. +# Iterate through ADs and use the first one with capacity available. +AD="" +for CANDIDATE in $(oci iam availability-domain list \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query 'data[].name' --raw-output | jq -r '.[]'); do + AVAIL=$(oci limits resource-availability get \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --service-name compute --limit-name gpu-a10-count \ + --availability-domain "${CANDIDATE}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query 'data.available' --raw-output 2>/dev/null) + if [[ "${AVAIL}" =~ ^[0-9]+$ ]] && (( AVAIL > 0 )); then + AD="${CANDIDATE}" + echo "Selected AD with ${AVAIL} A10s available: ${AD}" + break + fi +done +[[ -z "${AD}" ]] && { echo "No AD with A10 capacity in ${OCI_REGION}"; exit 1; } + +# CPU node pool (boot volume >= 100 GB for the router image) +oci ce node-pool create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --cluster-id "${CLUSTER_ID}" \ + --name "cpu-pool" \ + --kubernetes-version "${KUBERNETES_VERSION}" \ + --node-shape "VM.Standard.E5.Flex" \ + --node-shape-config '{"ocpus": 2, "memoryInGBs": 16}' \ + --node-image-id "${CPU_IMAGE_ID}" \ + --node-boot-volume-size-in-gbs 100 \ + --size 1 \ + --placement-configs "[{\"availabilityDomain\": \"${AD}\", \"subnetId\": \"${WORKER_SUBNET_ID}\"}]" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" + +# GPU node pool (boot volume 200 GB) +oci ce node-pool create \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --cluster-id "${CLUSTER_ID}" \ + --name "gpu-pool" \ + --kubernetes-version "${KUBERNETES_VERSION}" \ + --node-shape "VM.GPU.A10.1" \ + --node-image-id "${GPU_IMAGE_ID}" \ + --node-boot-volume-size-in-gbs 200 \ + --size 1 \ + --placement-configs "[{\"availabilityDomain\": \"${AD}\", \"subnetId\": \"${WORKER_SUBNET_ID}\"}]" \ + --initial-node-labels '[{"key": "app", "value": "gpu"}, {"key": "nvidia.com/gpu", "value": "true"}]' \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" +``` + +Wait for both node pools to show nodes as ACTIVE (~10 minutes): + +```bash +watch -n 30 "oci ce node-pool list \ + --compartment-id ${OCI_COMPARTMENT_ID} \ + --cluster-id ${CLUSTER_ID} \ + --profile ${OCI_PROFILE} --region ${OCI_REGION} \ + --query 'data[].{name:name,nodes:nodes[].{ip:\"private-ip\",state:\"lifecycle-state\"}}'" +``` + +**Do not proceed to Step 6 until both node pools show nodes as ACTIVE.** + +**Important:** The CPU boot volume must be at least **100 GB**. The vLLM router +image is ~10.5 GB and the default 47 GB boot volume causes pod eviction. + +## Step 6: Connect to the private cluster + +Download kubeconfig and configure for tunnel access: + +```bash +oci ce cluster create-kubeconfig \ + --cluster-id "${CLUSTER_ID}" \ + --file ~/.kube/config-nemotron \ + --region "${OCI_REGION}" \ + --token-version 2.0.0 \ + --kube-endpoint PRIVATE_ENDPOINT \ + --profile "${OCI_PROFILE}" --overwrite + +export KUBECONFIG=~/.kube/config-nemotron + +# Get the private endpoint IP +PRIVATE_IP=$(oci ce cluster get --cluster-id "${CLUSTER_ID}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query 'data.endpoints."private-endpoint"' --raw-output | cut -d: -f1) + +# Update kubeconfig to use localhost tunnel +CLUSTER_CTX=$(kubectl config view --minify -o jsonpath='{.clusters[0].name}') +kubectl config set-cluster "${CLUSTER_CTX}" \ + --server=https://127.0.0.1:6443 \ + --insecure-skip-tls-verify=true +``` + +If your OCI CLI profile is not `DEFAULT`, add it to the kubeconfig: + +```yaml +# In the users[].user.exec section, replace env: [] with: +env: + - name: OCI_CLI_PROFILE + value: YOUR_PROFILE +``` + +Create a Bastion session and start the SSH tunnel: + +```bash +SESSION_ID=$(oci bastion session create-port-forwarding \ + --bastion-id "${BASTION_ID}" \ + --target-private-ip "${PRIVATE_IP}" \ + --target-port 6443 \ + --session-ttl 10800 \ + --display-name "nemotron-kubectl" \ + --ssh-public-key-file ~/.ssh/id_ed25519.pub \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query "data.id" --raw-output) + +# Wait for session to become ACTIVE, then start tunnel +ssh -i ~/.ssh/id_ed25519 -N -L 6443:${PRIVATE_IP}:6443 \ + -p 22 -o StrictHostKeyChecking=no -o ServerAliveInterval=30 \ + ${SESSION_ID}@host.bastion.${OCI_REGION}.oci.oraclecloud.com & + +# Verify +kubectl get nodes +``` + +**Note:** Bastion sessions expire after the TTL (default 3 hours). Create a +new session and restart the tunnel when access drops. + +## Step 7: Expand boot volume filesystems + +OCI boot volumes provision only ~47 GB of usable root filesystem regardless +of the requested size. Both nodes must be expanded. + +**Why this matters:** The vLLM engine image is ~10 GB, the router image is +~10.5 GB, and the model weights are ~16 GB. Without expansion, pods get +evicted for low ephemeral storage. + +For each node, run the following (use a unique pod name per node): + +```bash +NODE_IP= +POD_NAME=expand-$(echo $NODE_IP | tr '.' '-') + +kubectl run ${POD_NAME} --restart=Never \ + --image=busybox:latest \ + --overrides="{ + \"spec\":{ + \"nodeName\":\"${NODE_IP}\", + \"tolerations\":[{\"operator\":\"Exists\"}], + \"containers\":[{ + \"name\":\"expand\", + \"image\":\"busybox:latest\", + \"command\":[\"sleep\",\"600\"], + \"securityContext\":{\"privileged\":true}, + \"volumeMounts\":[{\"name\":\"host\",\"mountPath\":\"/host\"}] + }], + \"volumes\":[{\"name\":\"host\",\"hostPath\":{\"path\":\"/\"}}] + } + }" + +kubectl wait --for=condition=Ready pod/${POD_NAME} --timeout=60s + +kubectl exec ${POD_NAME} -- chroot /host bash -c ' + growpart /dev/sda 3 + sleep 3 + pvresize /dev/sda3 + lvextend -l +100%FREE /dev/ocivolume/root + xfs_growfs / + df -h / +' + +kubectl delete pod ${POD_NAME} --force +``` + +Repeat for each node. Expected results: + +- GPU node (200 GB boot volume): 36 GB → ~189 GB usable +- CPU node (100 GB boot volume): 36 GB → ~89 GB usable + +Kubelet caches capacity at startup — in-place `systemctl restart kubelet` +does not refresh it. See Step 7b. + +## Step 7b: Soft-reset each node so kubelet re-reads disk capacity + +Drain each node, soft-reset the VM, wait for Ready, uncordon: + +```bash +for NODE_IP in ; do + # Resolve the OCI instance OCID via the node's providerID, which OKE + # sets to oci://. (`oci ce node-pool list` does not + # populate the nested `nodes` array, so a list-based lookup returns + # null; `get` per pool also works but is noisier.) + INSTANCE_ID=$(kubectl get node "${NODE_IP}" \ + -o jsonpath='{.spec.providerID}' | sed 's|^oci://||') + + kubectl cordon "${NODE_IP}" + kubectl drain "${NODE_IP}" --ignore-daemonsets --delete-emptydir-data \ + --force --grace-period=30 --timeout=120s || true + + oci compute instance action \ + --instance-id "${INSTANCE_ID}" --action SOFTRESET \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" + + # Wait for VM RUNNING, then for node Ready + until [[ "$(oci compute instance get --instance-id "${INSTANCE_ID}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query 'data."lifecycle-state"' --raw-output)" == "RUNNING" ]]; do + sleep 15 + done + until kubectl get node "${NODE_IP}" \ + -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' \ + | grep -q True; do + sleep 15 + done + + kubectl uncordon "${NODE_IP}" +done +``` + +Verify kubelet picked up the expanded capacity before continuing: + +```bash +for NODE in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do + CAP=$(kubectl get node "${NODE}" \ + -o jsonpath='{.status.capacity.ephemeral-storage}') + echo "${NODE}: ${CAP}" +done +``` + +Expected: CPU node ~`93476416Ki` (~89 GiB), GPU node ~`198056192Ki` (~189 GiB). +If either still shows ~`37206272Ki`, rerun the soft-reset for that node. + +## Step 8: Create StorageClasses + +```bash +kubectl apply -f - <<'EOF' +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + name: oci-block-storage-enc +provisioner: blockvolume.csi.oraclecloud.com +parameters: + vpusPerGB: "10" +reclaimPolicy: Delete +volumeBindingMode: WaitForFirstConsumer +allowVolumeExpansion: true +EOF +``` + +## Step 9: Patch CoreDNS for GPU tolerations + +```bash +kubectl patch deployment coredns -n kube-system --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/tolerations/-", + "value":{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}}]' + +kubectl patch deployment kube-dns-autoscaler -n kube-system --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/tolerations/-", + "value":{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}}]' +``` + +## Step 10: Create the templates PVC + +The `vllm-stack` chart (0.1.10) mounts a `vllm-templates-pvc` volume in every +engine pod. This PVC must exist before deploying: + +```bash +kubectl apply -f - <<'EOF' +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: vllm-templates-pvc + namespace: default +spec: + accessModes: + - ReadWriteOnce + storageClassName: oci-block-storage-enc + resources: + requests: + storage: 1Gi +EOF +``` + +## Step 11: Deploy vLLM + +The checked-in values file +[`vllm_oke_phoenix_private_values.yaml`](./vllm_oke_phoenix_private_values.yaml) +contains the validated configuration for this deployment. + +```bash +helm repo add vllm https://vllm-project.github.io/production-stack +helm repo update + +helm upgrade --install vllm vllm/vllm-stack \ + -n default \ + -f vllm_oke_phoenix_private_values.yaml +``` + +**Do not** pass `--wait` to Helm. The engine pod takes several minutes to pull +the image (~10 GB) and download the model. + +Monitor progress: + +```bash +kubectl get pods -n default -w +``` + +Wait for both pods to show `1/1 Running`: + +- `vllm-deployment-router-*` — request router (CPU node) +- `vllm-llama31-nemotron-nano-8b-deployment-vllm-*` — model engine (GPU node) + +## Step 12: Validate + +```bash +kubectl -n default port-forward svc/vllm-router-service 8080:80 +``` + +Health check: + +```bash +curl -s http://127.0.0.1:8080/health +# {"status":"healthy"} +``` + +Model discovery: + +```bash +curl -s http://127.0.0.1:8080/v1/models | jq . +``` + +Chat completion: + +```bash +curl -s http://127.0.0.1:8080/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "nvidia/Llama-3.1-Nemotron-Nano-8B-v1", + "messages": [{"role": "user", "content": "Reply with NEMOTRON_OK"}] + }' +``` + +Tool-calling smoke test: + +```bash +curl -s http://127.0.0.1:8080/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "nvidia/Llama-3.1-Nemotron-Nano-8B-v1", + "messages": [{"role": "user", "content": "What time is it in UTC?"}], + "tools": [{ + "type": "function", + "function": { + "name": "get_utc_time", + "description": "Return the current UTC time", + "parameters": {"type": "object", "properties": {}, "required": []} + } + }] + }' +``` + +Expected: `finish_reason` set to `tool_calls`. + +## Key vLLM settings + +| Setting | Value | Why | +|---------|-------|-----| +| `tag` | `v0.19.0` | Pinned to validated vLLM version | +| `maxModelLen` | `4096` | Conservative context to fit single A10 (24 GB) | +| `gpuMemoryUtilization` | `0.95` | Maximize GPU memory for KV cache | +| `enableTool` | `true` | Enable tool / function calling | +| `toolCallParser` | `llama3_json` | Parser matching Nemotron's tool format | +| `extraArgs` | `--chat-template=...` | Template passed as CLI arg (chart's `chatTemplate` field prepends `/templates/`) | +| `storageClass` | `oci-block-storage-enc` | OCI Block Volume with balanced performance | + +## Troubleshooting + +### Pods evicted for ephemeral storage + +OCI boot volumes provision only ~47 GB of usable filesystem by default. +Follow Step 7 to expand. If the boot volume itself is too small (default +47 GB), resize it first via the OCI CLI, then rescan the block device before +running `growpart`: + +```bash +echo 1 > /sys/class/block/sda/device/rescan +``` + +### Engine pod evicted mid image pull despite Step 7 reporting success + +Symptoms: engine pod reaches `ContainerCreating`, then kubelet evicts it with +`The node was low on resource: ephemeral-storage` (or `inodes`), and +`FreeDiskSpaceFailed: ... but only found 0 bytes eligible to free`. + +Cause: kubelet's `Node.Capacity.ephemeral-storage` is cached at startup. Even +after Step 7 expands the filesystem to ~189 GiB, kubelet continues to report +the original ~37 GiB and triggers eviction thresholds against the stale value. +Confirm with: + +```bash +kubectl describe node | grep "ephemeral-storage:" +``` + +If the value is ~`37206272Ki`, apply Step 7b (soft-reset the VM). An in-place +`systemctl restart kubelet` does **not** refresh the capacity. + +### SSH tunnel to OCI Bastion closes immediately after authentication + +Symptoms: `ssh -N -L 6443:... @host.bastion..oci.oraclecloud.com` +completes publickey auth, reports +`Local forwarding listening on 127.0.0.1 port 6443`, then: +`Connection to host.bastion..oci.oraclecloud.com closed by remote host.` +Port 6443 never stays open on the client. + +Cause: OpenSSH 10.x (shipped on macOS 15+ and recent Linux distros) is +incompatible with OCI Bastion's Go SSH server implementation for +port-forwarding sessions. + +Workaround: use the jump-host VM path in +[Appendix A](#appendix-a-jump-host-vm-alternative-openssh-10x). Downgrading +the client to OpenSSH 9.x also works but is typically impractical on macOS. + +### Engine pod stays Pending with PVC not found + +The `vllm-stack` chart (0.1.10) requires `vllm-templates-pvc` to exist +before the engine pod can schedule. See Step 10. + +### Engine pod crashes with chat template error + +The chart's `chatTemplate` field prepends `/templates/` to the path. Pass +the template via `vllmConfig.extraArgs` instead: + +```yaml +vllmConfig: + extraArgs: + - "--chat-template=/vllm-workspace/examples/tool_chat_template_llama3.1_json.jinja" +``` + +### Tool calling does not work + +Ensure all of these are set in the values file: + +- `enableTool: true` +- `toolCallParser: llama3_json` +- `--chat-template=...` in `vllmConfig.extraArgs` + +### `kubectl` cannot reach the cluster + +Re-establish the Bastion tunnel. Sessions expire after the configured TTL. + +### Helm upgrade fails with field manager conflict + +Uninstall and reinstall: + +```bash +helm uninstall vllm -n default +helm install vllm vllm/vllm-stack -n default -f vllm_oke_phoenix_private_values.yaml +``` + +## Cleanup + +To tear down all resources: + +```bash +# 1. Uninstall Helm release and PVCs +helm uninstall vllm -n default +kubectl delete pvc --all -n default + +# 2. List and delete node pools +oci ce node-pool list --compartment-id "${OCI_COMPARTMENT_ID}" \ + --cluster-id "${CLUSTER_ID}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query 'data[].{name:name,id:id}' --output table + +oci ce node-pool delete --node-pool-id --force \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" +oci ce node-pool delete --node-pool-id --force \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" + +# 3. Wait for node pools, then delete cluster +oci ce cluster delete --cluster-id "${CLUSTER_ID}" --force \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" + +# 4. Delete bastion +oci bastion bastion delete --bastion-id "${BASTION_ID}" --force \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" + +# 5. Wait for cluster deletion, then delete networking +# Delete subnets first, then route tables, gateways, and VCN +for SUBNET_ID in "${API_SUBNET_ID}" "${WORKER_SUBNET_ID}" \ + "${LB_SUBNET_ID}" "${BASTION_SUBNET_ID}"; do + oci network subnet delete --subnet-id "${SUBNET_ID}" --force \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" +done + +# Delete non-default route tables, security lists, then gateways, then VCN +``` + +## Alternative: Terraform + +A Terraform sample using the `oracle-terraform-modules/oke/oci` module is +available in [`terraform/`](./terraform/) for reference. Note that the +module's NSG configuration requires its built-in bastion compute host +(`create_bastion = true`) for OCI Bastion port-forwarding to work. The +manual CLI approach above is recommended for initial deployments. + +## Appendix A: Jump-host VM alternative (OpenSSH 10.x) + +Use this when `ssh -V` reports OpenSSH 10.x. Replaces Step 4 and the +bastion-session block in Step 6. + +Trade-off: this is a public-IP VM, not OCI's managed bastion service. +Terminate it during cleanup. + +### A.1 Launch the jump-host VM (replaces Step 4) + +```bash +OL_IMAGE_ID=$(oci compute image list \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --operating-system "Oracle Linux" --operating-system-version "9" \ + --shape "VM.Standard.E5.Flex" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query 'data[?"lifecycle-state"==`AVAILABLE`] | sort_by(@, &"time-created") | [-1].id' \ + --raw-output) + +SSH_PUB=$(cat ~/.ssh/id_ed25519.pub) +METADATA=$(jq -cn --arg k "${SSH_PUB}" '{"ssh_authorized_keys": $k}') + +oci compute instance launch \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --availability-domain "${AD}" \ + --display-name "${CLUSTER_NAME}-jumphost" \ + --shape "VM.Standard.E5.Flex" \ + --shape-config '{"ocpus":1,"memoryInGBs":8}' \ + --image-id "${OL_IMAGE_ID}" \ + --subnet-id "${BASTION_SUBNET_ID}" \ + --assign-public-ip true \ + --metadata "${METADATA}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --wait-for-state RUNNING + +JUMP_HOST_ID=$(oci compute instance list \ + --compartment-id "${OCI_COMPARTMENT_ID}" \ + --display-name "${CLUSTER_NAME}-jumphost" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query 'data[?"lifecycle-state"==`RUNNING`] | [0].id' --raw-output) + +VNIC_ID=$(oci compute vnic-attachment list \ + --compartment-id "${OCI_COMPARTMENT_ID}" --instance-id "${JUMP_HOST_ID}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query 'data[0]."vnic-id"' --raw-output) + +JUMP_HOST_IP=$(oci network vnic get --vnic-id "${VNIC_ID}" \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" \ + --query 'data."public-ip"' --raw-output) + +echo "Jump host public IP: ${JUMP_HOST_IP}" + +# cloud-init may still be copying authorized_keys when the VM first reports +# RUNNING — wait for port 22 to accept connections before using ssh. +until nc -z -G 3 "${JUMP_HOST_IP}" 22 2>/dev/null; do sleep 2; done +``` + +### A.2 Open the tunnel through the jump-host (replaces Step 6 bastion block) + +Run Step 6 up through the `kubectl config set-cluster` server-URL rewrite, +then skip the `oci bastion session` block and tunnel directly: + +```bash +nohup ssh -f -N -L 6443:${PRIVATE_IP}:6443 \ + -i ~/.ssh/id_ed25519 \ + -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \ + -o IdentitiesOnly=yes -o ServerAliveInterval=30 \ + -o ExitOnForwardFailure=yes \ + opc@${JUMP_HOST_IP} < /dev/null > /tmp/nemotron-ssh-tunnel.log 2>&1 + +nc -z 127.0.0.1 6443 && echo "tunnel up" || echo "tunnel failed" +kubectl get nodes +``` + +No session TTL; restart the tunnel after a laptop sleep or network change. + +### A.3 Cleanup addition + +When running the cleanup steps, also terminate the jump-host: + +```bash +oci compute instance terminate --instance-id "${JUMP_HOST_ID}" --force \ + --preserve-boot-volume false \ + --profile "${OCI_PROFILE}" --region "${OCI_REGION}" +``` diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.gitignore b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.gitignore new file mode 100644 index 000000000..1a22d40bb --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.gitignore @@ -0,0 +1,5 @@ +.terraform/ +terraform.tfvars +terraform.tfstate +terraform.tfstate.* +tfplan diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.terraform.lock.hcl b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.terraform.lock.hcl new file mode 100644 index 000000000..a539a5863 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/.terraform.lock.hcl @@ -0,0 +1,145 @@ +# This file is maintained automatically by "terraform init". +# Manual edits may be lost in future updates. + +provider "registry.terraform.io/hashicorp/cloudinit" { + version = "2.3.7" + constraints = ">= 2.2.0" + hashes = [ + "h1:M9TpQxKAE/hyOwytdX9MUNZw30HoD/OXqYIug5fkqH8=", + "zh:06f1c54e919425c3139f8aeb8fcf9bceca7e560d48c9f0c1e3bb0a8ad9d9da1e", + "zh:0e1e4cf6fd98b019e764c28586a386dc136129fef50af8c7165a067e7e4a31d5", + "zh:1871f4337c7c57287d4d67396f633d224b8938708b772abfc664d1f80bd67edd", + "zh:2b9269d91b742a71b2248439d5e9824f0447e6d261bfb86a8a88528609b136d1", + "zh:3d8ae039af21426072c66d6a59a467d51f2d9189b8198616888c1b7fc42addc7", + "zh:3ef4e2db5bcf3e2d915921adced43929214e0946a6fb11793085d9a48995ae01", + "zh:42ae54381147437c83cbb8790cc68935d71b6357728a154109d3220b1beb4dc9", + "zh:4496b362605ae4cbc9ef7995d102351e2fe311897586ffc7a4a262ccca0c782a", + "zh:652a2401257a12706d32842f66dac05a735693abcb3e6517d6b5e2573729ba13", + "zh:7406c30806f5979eaed5f50c548eced2ea18ea121e01801d2f0d4d87a04f6a14", + "zh:7848429fd5a5bcf35f6fee8487df0fb64b09ec071330f3ff240c0343fe2a5224", + "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", + ] +} + +provider "registry.terraform.io/hashicorp/helm" { + version = "3.1.1" + constraints = ">= 3.0.1" + hashes = [ + "h1:47CqNwkxctJtL/N/JuEj+8QMg8mRNI/NWeKO5/ydfZU=", + "zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275", + "zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a", + "zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29", + "zh:95a2a0a497a6082ee06f95b38bd0f0d6924a65722892a856cfd914c0d117f104", + "zh:9d3e78c2d1bb46508b972210ad706dd8c8b106f8b206ecf096cd211c54f46990", + "zh:a79139abf687387a6efdbbb04289a0a8e7eaca2bd91cdc0ce68ea4f3286c2c34", + "zh:aaa8784be125fbd50c48d84d6e171d3fb6ef84a221dbc5165c067ce05faab4c8", + "zh:afecd301f469975c9d8f350cc482fe656e082b6ab0f677d1a816c3c615837cc1", + "zh:c54c22b18d48ff9053d899d178d9ffef7d9d19785d9bf310a07d648b7aac075b", + "zh:db2eefd55aea48e73384a555c72bac3f7d428e24147bedb64e1a039398e5b903", + "zh:ee61666a233533fd2be971091cecc01650561f1585783c381b6f6e8a390198a4", + "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c", + ] +} + +provider "registry.terraform.io/hashicorp/http" { + version = "3.5.0" + constraints = ">= 3.2.1" + hashes = [ + "h1:dl73+8wzQR++HFGoJgDqY3mj3pm14HUuH/CekVyOj5s=", + "zh:047c5b4920751b13425efe0d011b3a23a3be97d02d9c0e3c60985521c9c456b7", + "zh:157866f700470207561f6d032d344916b82268ecd0cf8174fb11c0674c8d0736", + "zh:1973eb9383b0d83dd4fd5e662f0f16de837d072b64a6b7cd703410d730499476", + "zh:212f833a4e6d020840672f6f88273d62a564f44acb0c857b5961cdb3bbc14c90", + "zh:2c8034bc039fffaa1d4965ca02a8c6d57301e5fa9fff4773e684b46e3f78e76a", + "zh:5df353fc5b2dd31577def9cc1a4ebf0c9a9c2699d223c6b02087a3089c74a1c6", + "zh:672083810d4185076c81b16ad13d1224b9e6ea7f4850951d2ab8d30fa6e41f08", + "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", + "zh:7b4200f18abdbe39904b03537e1a78f21ebafe60f1c861a44387d314fda69da6", + "zh:843feacacd86baed820f81a6c9f7bd32cf302db3d7a0f39e87976ebc7a7cc2ee", + "zh:a9ea5096ab91aab260b22e4251c05f08dad2ed77e43e5e4fadcdfd87f2c78926", + "zh:d02b288922811739059e90184c7f76d45d07d3a77cc48d0b15fd3db14e928623", + ] +} + +provider "registry.terraform.io/hashicorp/null" { + version = "3.2.4" + constraints = ">= 3.2.1" + hashes = [ + "h1:L5V05xwp/Gto1leRryuesxjMfgZwjb7oool4WS1UEFQ=", + "zh:59f6b52ab4ff35739647f9509ee6d93d7c032985d9f8c6237d1f8a59471bbbe2", + "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", + "zh:795c897119ff082133150121d39ff26cb5f89a730a2c8c26f3a9c1abf81a9c43", + "zh:7b9c7b16f118fbc2b05a983817b8ce2f86df125857966ad356353baf4bff5c0a", + "zh:85e33ab43e0e1726e5f97a874b8e24820b6565ff8076523cc2922ba671492991", + "zh:9d32ac3619cfc93eb3c4f423492a8e0f79db05fec58e449dee9b2d5873d5f69f", + "zh:9e15c3c9dd8e0d1e3731841d44c34571b6c97f5b95e8296a45318b94e5287a6e", + "zh:b4c2ab35d1b7696c30b64bf2c0f3a62329107bd1a9121ce70683dec58af19615", + "zh:c43723e8cc65bcdf5e0c92581dcbbdcbdcf18b8d2037406a5f2033b1e22de442", + "zh:ceb5495d9c31bfb299d246ab333f08c7fb0d67a4f82681fbf47f2a21c3e11ab5", + "zh:e171026b3659305c558d9804062762d168f50ba02b88b231d20ec99578a6233f", + "zh:ed0fe2acdb61330b01841fa790be00ec6beaac91d41f311fb8254f74eb6a711f", + ] +} + +provider "registry.terraform.io/hashicorp/random" { + version = "3.8.1" + constraints = ">= 3.4.3" + hashes = [ + "h1:u8AKlWVDTH5r9YLSeswoVEjiY72Rt4/ch7U+61ZDkiQ=", + "zh:08dd03b918c7b55713026037c5400c48af5b9f468f483463321bd18e17b907b4", + "zh:0eee654a5542dc1d41920bbf2419032d6f0d5625b03bd81339e5b33394a3e0ae", + "zh:229665ddf060aa0ed315597908483eee5b818a17d09b6417a0f52fd9405c4f57", + "zh:2469d2e48f28076254a2a3fc327f184914566d9e40c5780b8d96ebf7205f8bc0", + "zh:37d7eb334d9561f335e748280f5535a384a88675af9a9eac439d4cfd663bcb66", + "zh:741101426a2f2c52dee37122f0f4a2f2d6af6d852cb1db634480a86398fa3511", + "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", + "zh:a902473f08ef8df62cfe6116bd6c157070a93f66622384300de235a533e9d4a9", + "zh:b85c511a23e57a2147355932b3b6dce2a11e856b941165793a0c3d7578d94d05", + "zh:c5172226d18eaac95b1daac80172287b69d4ce32750c82ad77fa0768be4ea4b8", + "zh:dab4434dba34aad569b0bc243c2d3f3ff86dd7740def373f2a49816bd2ff819b", + "zh:f49fd62aa8c5525a5c17abd51e27ca5e213881d58882fd42fec4a545b53c9699", + ] +} + +provider "registry.terraform.io/hashicorp/time" { + version = "0.13.1" + constraints = ">= 0.9.1" + hashes = [ + "h1:ZT5ppCNIModqk3iOkVt5my8b8yBHmDpl663JtXAIRqM=", + "zh:02cb9aab1002f0f2a94a4f85acec8893297dc75915f7404c165983f720a54b74", + "zh:04429b2b31a492d19e5ecf999b116d396dac0b24bba0d0fb19ecaefe193fdb8f", + "zh:26f8e51bb7c275c404ba6028c1b530312066009194db721a8427a7bc5cdbc83a", + "zh:772ff8dbdbef968651ab3ae76d04afd355c32f8a868d03244db3f8496e462690", + "zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3", + "zh:898db5d2b6bd6ca5457dccb52eedbc7c5b1a71e4a4658381bcbb38cedbbda328", + "zh:8de913bf09a3fa7bedc29fec18c47c571d0c7a3d0644322c46f3aa648cf30cd8", + "zh:9402102c86a87bdfe7e501ffbb9c685c32bbcefcfcf897fd7d53df414c36877b", + "zh:b18b9bb1726bb8cfbefc0a29cf3657c82578001f514bcf4c079839b6776c47f0", + "zh:b9d31fdc4faecb909d7c5ce41d2479dd0536862a963df434be4b16e8e4edc94d", + "zh:c951e9f39cca3446c060bd63933ebb89cedde9523904813973fbc3d11863ba75", + "zh:e5b773c0d07e962291be0e9b413c7a22c044b8c7b58c76e8aa91d1659990dfb5", + ] +} + +provider "registry.terraform.io/oracle/oci" { + version = "8.5.0" + constraints = ">= 4.67.3, >= 7.30.0" + hashes = [ + "h1:YGSTTLRk0vpD4P0dJFt2lZ2XphT2skF9AxBGCkM04z4=", + "zh:0289ba575d3749068fc12fdbfa3f44b9780b21a23315eb2ca5bcf73065cc4fe7", + "zh:1152fd8451c2b74d87594fda1aa69e6a3f772189b902a592e91fcc57dfe3c48f", + "zh:3e4b1a2e345263e48d6be4d6d01fd5976b09af585e4a9314d318ab216304b8f1", + "zh:6b88ebb0ed7de80e324124511251561072c8a5f1ae222aa588063a1652ff72e8", + "zh:8ef61c735f19e1be9abeeb79debbeacd91e5996b4be5719d61323244e19ebe3d", + "zh:8fcdc6701173b59d78f076f8ce4ce01ef127bf5bf65323340e23c0b14da02f9d", + "zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425", + "zh:a03e6f788876b7408d811eb21056986e15c46876983637e7e5e645fff28d0587", + "zh:b1149065247943c0937359e0f2ed5fdce9c2a588e32e90b9c13be64f709f8121", + "zh:b375612ef300e7f53797552521d3ec10f3d9465ccbe6d96519314e32d6611c93", + "zh:daf49947168641d170f59907b2592f020ab17f5443e8f5a96174219112d51fe2", + "zh:e9649887105493b311cbaf180ba635186e1a4c3b5fe7e26ea9bfd06a52aa76f3", + "zh:f593bb15d46c5c998401fea9cc3fdf7950b81a53632ecb1bea8d2cc41971ccca", + "zh:f7f1f4d0c5922bd0403b989ebed168577164dbfc45181b2e19dcb888e1fc9df7", + "zh:fafce2b47e3227dc8068db4f2bf223c4a4b8fefe39f50aeced467eed1bd901e3", + ] +} diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/README.md b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/README.md new file mode 100644 index 000000000..9d0fe219f --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/README.md @@ -0,0 +1,97 @@ +# Terraform: Private OCI OKE for Llama-3.1-Nemotron-Nano-8B-v1 + +This Terraform example provisions the **private-only** OCI infrastructure for +the validated Phoenix deployment described in the parent cookbook. + +It is intended to give Nemotron users a reproducible OCI path for NVIDIA model +serving that highlights Oracle Cloud's operational strengths: private OKE, +managed Bastion access, and a clean infrastructure-as-code path for GPU-backed +Nemotron deployments. + +It creates: + +- a VCN +- a **private** OKE cluster +- a private CPU node pool +- a private GPU node pool targeting `VM.GPU.A10.1` +- an **OCI Bastion service** resource for private access + +It does **not** create: + +- a public Kubernetes API endpoint +- public worker-node IPs +- a public bastion host +- a public inference endpoint + +## Bastion note + +This sample provisions the **OCI Bastion service** so that private-cluster +access is reproducible from Terraform. + +That is intentionally different from creating a public bastion VM: + +- no public bastion compute instance is created +- no worker node receives a public IP +- the Kubernetes API remains private + +If your environment already manages private-cluster access through a separate +operator workflow, you can remove the `oci_bastion_bastion` resource and keep +the rest of the sample unchanged. + +## Module choice + +This wrapper intentionally uses Oracle's official OKE Terraform module: + +- `oracle-terraform-modules/oke/oci` + +The Nemotron-specific layer in this directory adds: + +- the Phoenix defaults +- the no-public-IP constraints +- the A10-focused worker pool defaults +- the OCI Bastion service resource required for private access + +## Files + +- [`main.tf`](./main.tf) - private OKE cluster, worker pools, OCI Bastion +- [`variables.tf`](./variables.tf) - deployment inputs +- [`outputs.tf`](./outputs.tf) - useful IDs and private endpoint information +- [`terraform.tfvars.example`](./terraform.tfvars.example) - starting point + +## Usage + +```bash +cp terraform.tfvars.example terraform.tfvars +terraform init +terraform plan +terraform apply +``` + +The validated live run completed successfully in `us-phoenix-1`, including: + +- private OKE cluster creation +- OCI Bastion service creation +- CPU node pool creation +- GPU node pool creation on `VM.GPU.A10.1` in `PHX-AD-2` + +After the infrastructure is ready: + +1. create an OCI Bastion session to reach the private cluster +2. deploy the model with: + - [`../vllm_oke_phoenix_private_values.yaml`](../vllm_oke_phoenix_private_values.yaml) +3. validate: + - `/health` + - `/v1/models` + - chat completion + - tool calling + - streaming + +## Notes + +- The validated live deployment used `us-phoenix-1`. +- The validated GPU pool used Phoenix `AD-2`, exposed as `gpu_placement_ads`. +- The Bastion resource here is the OCI managed Bastion service, not a public + bastion VM. +- `ssh_public_key_path` must point to an actual OpenSSH public key file; the + wrapper reads the file contents with Terraform's `file()` function before + passing it to OKE. diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/main.tf b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/main.tf new file mode 100644 index 000000000..e9b070784 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/main.tf @@ -0,0 +1,112 @@ +provider "oci" { + config_file_profile = var.config_file_profile + tenancy_ocid = var.tenancy_ocid + region = var.region +} + +locals { + common_tags = merge(var.freeform_tags, { + model = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1" + deployment = "private-oke" + region = var.region + }) +} + +module "oke" { + source = "oracle-terraform-modules/oke/oci" + version = "5.4.1" + + providers = { + oci.home = oci + } + + tenancy_id = var.tenancy_ocid + compartment_id = var.compartment_ocid + region = var.region + + cluster_name = var.cluster_name + kubernetes_version = var.kubernetes_version + cluster_type = "enhanced" + cni_type = "flannel" + pods_cidr = var.pods_cidr + services_cidr = var.services_cidr + vcn_cidrs = var.vcn_cidrs + ssh_public_key = file(var.ssh_public_key_path) + output_detail = true + create_vcn = true + create_bastion = false + create_operator = false + control_plane_is_public = false + assign_public_ip_to_control_plane = false + worker_is_public = false + allow_worker_internet_access = true + allow_pod_internet_access = true + allow_worker_ssh_access = false + preferred_load_balancer = "internal" + load_balancers = "internal" + freeform_tags = { all = local.common_tags } + + subnets = { + cp = { + create = "always" + newbits = 13 + netnum = 2 + } + workers = { + create = "always" + newbits = 2 + netnum = 1 + } + pods = { + create = "always" + newbits = 2 + netnum = 2 + } + int_lb = { + create = "always" + newbits = 11 + netnum = 16 + } + pub_lb = { + create = "never" + } + bastion = { + create = "never" + } + operator = { + create = "never" + } + } + + worker_pool_mode = "node-pool" + worker_pool_size = 1 + worker_pools = { + cpu = { + size = var.cpu_pool_size + shape = var.cpu_shape + ocpus = var.cpu_ocpus + memory = var.cpu_memory_gbs + boot_volume_size = 100 + assign_public_ip = false + create = true + } + gpu = { + size = var.gpu_pool_size + shape = var.gpu_shape + boot_volume_size = var.gpu_boot_volume_size + assign_public_ip = false + create = true + placement_ads = var.gpu_placement_ads + } + } +} + +resource "oci_bastion_bastion" "oci_bastion" { + compartment_id = var.compartment_ocid + bastion_type = "STANDARD" + target_subnet_id = module.oke.worker_subnet_id + client_cidr_block_allow_list = var.bastion_client_cidrs + max_session_ttl_in_seconds = 10800 + name = "${var.cluster_name}-bastion" + freeform_tags = local.common_tags +} diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/outputs.tf b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/outputs.tf new file mode 100644 index 000000000..c39a82eed --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/outputs.tf @@ -0,0 +1,34 @@ +output "cluster_id" { + description = "OKE cluster OCID." + value = module.oke.cluster_id +} + +output "cluster_endpoints" { + description = "Cluster endpoints; private endpoint should be used." + value = module.oke.cluster_endpoints +} + +output "apiserver_private_host" { + description = "Private control-plane host." + value = module.oke.apiserver_private_host +} + +output "vcn_id" { + description = "VCN used by the Nemotron deployment." + value = module.oke.vcn_id +} + +output "control_plane_subnet_id" { + description = "Private control-plane subnet." + value = module.oke.control_plane_subnet_id +} + +output "worker_subnet_id" { + description = "Private worker subnet." + value = module.oke.worker_subnet_id +} + +output "oci_bastion_id" { + description = "OCI Bastion service OCID for creating private sessions." + value = oci_bastion_bastion.oci_bastion.id +} diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/terraform.tfvars.example b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/terraform.tfvars.example new file mode 100644 index 000000000..9a2bab0ce --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/terraform.tfvars.example @@ -0,0 +1,12 @@ +tenancy_ocid = "ocid1.tenancy.oc1..exampleuniqueID" +compartment_ocid = "ocid1.compartment.oc1..exampleuniqueID" +config_file_profile = "API_KEY_AUTH" +region = "us-phoenix-1" +cluster_name = "nemotron-phx-private" +ssh_public_key_path = "~/.ssh/id_ed25519.pub" + +# Restrict Bastion session creation to your current client egress CIDR. +bastion_client_cidrs = ["203.0.113.10/32"] + +# The validated deployment used Phoenix AD-2 for the A10 node pool. +gpu_placement_ads = [2] diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/variables.tf b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/variables.tf new file mode 100644 index 000000000..165cabf57 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/variables.tf @@ -0,0 +1,115 @@ +variable "tenancy_ocid" { + description = "OCI tenancy OCID." + type = string +} + +variable "compartment_ocid" { + description = "Compartment where the OKE cluster and Bastion service will be created." + type = string +} + +variable "region" { + description = "OCI region for the deployment." + type = string + default = "us-phoenix-1" +} + +variable "config_file_profile" { + description = "OCI CLI config profile name." + type = string + default = "DEFAULT" +} + +variable "cluster_name" { + description = "Name prefix for the private Nemotron OKE deployment." + type = string + default = "nemotron-oci-phx" +} + +variable "ssh_public_key_path" { + description = "Path to the OpenSSH public key file used for private worker access." + type = string +} + +variable "vcn_cidrs" { + description = "VCN CIDR blocks for the deployment." + type = list(string) + default = ["10.0.0.0/16"] +} + +variable "pods_cidr" { + description = "Kubernetes pods CIDR." + type = string + default = "10.244.0.0/16" +} + +variable "services_cidr" { + description = "Kubernetes services CIDR." + type = string + default = "10.96.0.0/16" +} + +variable "kubernetes_version" { + description = "OKE Kubernetes version." + type = string + default = "v1.33.1" +} + +variable "cpu_pool_size" { + description = "Number of CPU worker nodes." + type = number + default = 1 +} + +variable "cpu_shape" { + description = "Shape for the CPU worker pool." + type = string + default = "VM.Standard.E5.Flex" +} + +variable "cpu_ocpus" { + description = "OCPUs for each CPU worker if using a flex shape." + type = number + default = 2 +} + +variable "cpu_memory_gbs" { + description = "Memory in GB for each CPU worker if using a flex shape." + type = number + default = 16 +} + +variable "gpu_pool_size" { + description = "Number of GPU worker nodes." + type = number + default = 1 +} + +variable "gpu_shape" { + description = "Shape for the GPU worker pool." + type = string + default = "VM.GPU.A10.1" +} + +variable "gpu_boot_volume_size" { + description = "Boot volume size for GPU workers." + type = number + default = 200 +} + +variable "gpu_placement_ads" { + description = "Availability domains to target for the GPU node pool. Phoenix AD-2 is `[2]`." + type = list(number) + default = [2] +} + +variable "bastion_client_cidrs" { + description = "CIDR blocks allowed to create OCI Bastion sessions." + type = list(string) +} + +variable "freeform_tags" { + description = "Optional freeform tags." + type = map(string) + default = {} +} diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/versions.tf b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/versions.tf new file mode 100644 index 000000000..1c9c02641 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/terraform/versions.tf @@ -0,0 +1,10 @@ +terraform { + required_version = ">= 1.5.0" + + required_providers { + oci = { + source = "oracle/oci" + version = ">= 7.30.0" + } + } +} diff --git a/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/vllm_oke_phoenix_private_values.yaml b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/vllm_oke_phoenix_private_values.yaml new file mode 100644 index 000000000..5b4538c37 --- /dev/null +++ b/usage-cookbook/Llama-3.1-Nemotron-Nano-8B-v1/vllm_oke_phoenix_private_values.yaml @@ -0,0 +1,37 @@ +# Validated private OCI OKE deployment values for +# nvidia/Llama-3.1-Nemotron-Nano-8B-v1 on a single VM.GPU.A10.1 node. +# +# Chart: vllm/vllm-stack 0.1.10 +# Validated: 2026-04-15 on OKE v1.31.10, Phoenix (us-phoenix-1) +# +# IMPORTANT: Before deploying, you must create the vllm-templates-pvc +# (see prerequisites in README.md). + +servingEngineSpec: + runtimeClassName: "" + modelSpec: + - name: "llama31-nemotron-nano-8b" + repository: "vllm/vllm-openai" + tag: "v0.19.0" + modelURL: "nvidia/Llama-3.1-Nemotron-Nano-8B-v1" + enableTool: true + toolCallParser: "llama3_json" + replicaCount: 1 + requestCPU: 4 + requestMemory: "24Gi" + requestGPU: 1 + pvcStorage: "120Gi" + pvcAccessMode: + - ReadWriteOnce + storageClass: "oci-block-storage-enc" + nodeSelector: + app: gpu + tolerations: + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" + vllmConfig: + maxModelLen: 4096 + gpuMemoryUtilization: 0.95 + extraArgs: + - "--chat-template=/vllm-workspace/examples/tool_chat_template_llama3.1_json.jinja" diff --git a/usage-cookbook/README.md b/usage-cookbook/README.md index f7d79b5ca..001121f60 100644 --- a/usage-cookbook/README.md +++ b/usage-cookbook/README.md @@ -13,5 +13,4 @@ This directory contains cookbook-style guides showing how to deploy and use the - **SGLang Deployment** - Tutorials on serving and interacting with Nemotron via SGLang - **NIM Microservice** - Guide to deploying Nemotron as scalable, production-ready endpoints using NVIDIA Inference Microservices (NIM). - **Hugging Face Transformers** - Direct loading and inference of Nemotron models with Hugging Face Transformers - - +- **OCI OKE Private Deployment** - A Phoenix-only private deployment guide for `nvidia/Llama-3.1-Nemotron-Nano-8B-v1` using OKE, OCI Bastion service, and `vLLM`, providing a reproducible OCI path comparable to common AWS GPU/Kubernetes deployment patterns.