From 32418835cce2577977ec67e6c0a062b7dab7af00 Mon Sep 17 00:00:00 2001 From: Suraj Deshmukh Date: Sat, 8 Mar 2025 14:03:41 -0800 Subject: [PATCH 1/2] gpu-cluster: Fix formatting - Remove trailing whitespaces. - Remove unused links. Signed-off-by: Suraj Deshmukh --- articles/aks/gpu-cluster.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/articles/aks/gpu-cluster.md b/articles/aks/gpu-cluster.md index bceba330f..2eed18011 100644 --- a/articles/aks/gpu-cluster.md +++ b/articles/aks/gpu-cluster.md @@ -13,7 +13,7 @@ ms.author: schaffererin # Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS) -Graphical processing units (GPUs) are often used for compute-intensive workloads, such as graphics and visualization workloads. AKS supports GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads. +Graphical processing units (GPUs) are often used for compute-intensive workloads, such as graphics and visualization workloads. AKS supports GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads. This article helps you provision nodes with schedulable GPUs on new and existing AKS clusters. @@ -93,7 +93,7 @@ To use the default OS SKU, you create the node pool without specifying an OS SKU * `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool. > [!NOTE] - > Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time. + > Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time. ##### [Azure Linux node pool](#tab/add-azure-linux-gpu-node-pool) @@ -354,7 +354,7 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro ```console 2019-05-16 16:08:31.258328: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA - 2019-05-16 16:08:31.396846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: + 2019-05-16 16:08:31.396846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 pciBusID: 2fd7:00:00.0 totalMemory: 11.17GiB freeMemory: 11.10GiB @@ -436,8 +436,6 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro [nvidia-github]: https://github.com/NVIDIA/k8s-device-plugin/blob/4b3d6b0a6613a3672f71ea4719fd8633eaafb4f3/deployments/static/nvidia-device-plugin.yml -[az-aks-create]: /cli/azure/aks#az_aks_create -[az-aks-nodepool-update]: /cli/azure/aks/nodepool#az_aks_nodepool_update [az-aks-nodepool-add]: /cli/azure/aks/nodepool#az_aks_nodepool_add [az-aks-get-credentials]: /cli/azure/aks#az_aks_get_credentials [aks-quickstart-cli]: ./learn/quick-kubernetes-deploy-cli.md @@ -451,10 +449,6 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro [azureml-triton]: /azure/machine-learning/how-to-deploy-with-triton [aks-container-insights]: monitor-aks.md#integrations [advanced-scheduler-aks]: operator-best-practices-advanced-scheduler.md -[az-provider-register]: /cli/azure/provider#az-provider-register -[az-feature-register]: /cli/azure/feature#az-feature-register -[az-feature-show]: /cli/azure/feature#az-feature-show [az-extension-add]: /cli/azure/extension#az-extension-add [az-extension-update]: /cli/azure/extension#az-extension-update -[NVadsA10]: /azure/virtual-machines/nva10v5-series From df1330f38377f43413ae555379a4bf3343a4990b Mon Sep 17 00:00:00 2001 From: Suraj Deshmukh Date: Sat, 8 Mar 2025 14:04:26 -0800 Subject: [PATCH 2/2] gpu-cluster: Remove step to create `gpu-operator` ns The daemonset is deployed in the `kube-system` namespace and not in the `gpu-operator` namespace. The step to create the `gpu-operator` namespace is not required. Signed-off-by: Suraj Deshmukh --- articles/aks/gpu-cluster.md | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/articles/aks/gpu-cluster.md b/articles/aks/gpu-cluster.md index 2eed18011..38165d459 100644 --- a/articles/aks/gpu-cluster.md +++ b/articles/aks/gpu-cluster.md @@ -128,13 +128,7 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d --- -1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command. - - ```bash - kubectl create namespace gpu-operator - ``` - -2. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]: +1. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]: ```yaml apiVersion: apps/v1 @@ -182,13 +176,13 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d path: /var/lib/kubelet/device-plugins ``` -3. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command. +2. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command. ```bash kubectl apply -f nvidia-device-plugin-ds.yaml ``` -4. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload). +3. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload). ### Skip GPU driver installation (preview) @@ -430,7 +424,6 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro [kubectl-describe]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#describe [kubectl-logs]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#logs [kubectl delete]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#delete -[kubectl-create]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#create [azure-pricing]: https://azure.microsoft.com/pricing/ [azure-availability]: https://azure.microsoft.com/global-infrastructure/services/ [nvidia-github]: https://github.com/NVIDIA/k8s-device-plugin/blob/4b3d6b0a6613a3672f71ea4719fd8633eaafb4f3/deployments/static/nvidia-device-plugin.yml