Skip to content

Commit

Permalink
Backport revamp changes to gpu-tele
Browse files Browse the repository at this point in the history
  • Loading branch information
mikemckiernan committed Jun 13, 2023
1 parent 0f7efce commit 6eb8916
Show file tree
Hide file tree
Showing 7 changed files with 883 additions and 622 deletions.
39 changes: 39 additions & 0 deletions gpu-telemetry/about-telemetry.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
.. license-header
SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
.. headings (h1/h2/h3/h4/h5) are # * = -
###################
About GPU Telemetry
###################

Monitoring stacks usually consist of a collector, a time-series database to store metrics and a visualization layer.
A popular open-source stack is `Prometheus <https://prometheus.io/>`_ used along with `Grafana <https://grafana.com/>`_ as
the visualization tool to create rich dashboards. Prometheus also includes an `Alertmanager <https://github.com/prometheus/alertmanager>`_,
to create and manage alerts. Prometheus is deployed along with `kube-state-metrics <https://github.com/kubernetes/kube-state-metrics>`_ and
`node_exporter <https://github.com/prometheus/node_exporter>`_ to expose cluster-level metrics for Kubernetes API objects and node-level
metrics such as CPU utilization.

An architecture of Prometheus is shown in the figure below:

.. image:: https://boxboat.com/2019/08/08/monitoring-kubernetes-with-prometheus/prometheus-architecture.png
:width: 800


To gather GPU telemetry in Kubernetes, its recommended to use DCGM Exporter. DCGM Exporter, based on `DCGM <https://developer.nvidia.com/dcgm>`_ exposes
GPU metrics for Prometheus and can be visualized using Grafana. DCGM Exporter is architected to take advantage of
``KubeletPodResources`` `API <https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/>`_ and exposes GPU metrics in a format that can be
scraped by Prometheus. A ``ServiceMonitor`` is also included to expose endpoints.
670 changes: 48 additions & 622 deletions gpu-telemetry/dcgm-exporter.rst

Large diffs are not rendered by default.

31 changes: 31 additions & 0 deletions gpu-telemetry/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
.. license-header
SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
.. headings (h1/h2/h3/h4/h5) are # * = -
.. toctree::
:caption: DGCM Exporter
:titlesonly:
:hidden:

about-telemetry
dcgm-exporter
integrating-telemetry-kubernetes
kube-prometheus


.. include:: about-telemetry.rst
:start-line: 18
276 changes: 276 additions & 0 deletions gpu-telemetry/integrating-telemetry-kubernetes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
.. license-header
SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
.. headings (h1/h2/h3/h4/h5) are # * = -
#########################################
Integrating GPU Telemetry into Kubernetes
#########################################

.. contents::
:depth: 5
:local:
:backlinks: none


*************************
Benefits of GPU Telemetry
*************************

Understanding GPU usage provides important insights for IT administrators managing a data center.
Trends in GPU metrics correlate with workload behavior and make it possible to optimize resource allocation,
diagnose anomalies, and increase overall data center efficiency. As GPUs become more mainstream in
Kubernetes environments, users would like to get access to GPU metrics to monitor GPU resources, just
like they do today for CPUs.

The purpose of this document is to enumerate an end-to-end (e2e) workflow
for setting up and using `DCGM <https://developer.nvidia.com/dcgm>`_ within a Kubernetes environment.

For simplicity, the base environment being used in this guide is Ubuntu 18.04 LTS and
a native installation of the NVIDIA drivers on the GPU enabled nodes (i.e. neither
the `NVIDIA GPU Operator <https://github.com/NVIDIA/gpu-operator>`_ nor containerized drivers are used
in this document).

**************
NVIDIA Drivers
**************
This section provides a summary of the steps for installing the driver using the ``apt`` package manager on Ubuntu LTS.

.. note::

For complete instructions on setting up NVIDIA drivers, visit the quickstart guide at https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html.
The guide covers a number of pre-installation requirements and steps on supported Linux distributions for a successful install of the driver.


Install the kernel headers and development packages for the currently running kernel:

.. code-block:: console
$ sudo apt-get install linux-headers-$(uname -r)
Setup the CUDA network repository and ensure packages on the CUDA network repository have priority over the Canonical repository:

.. code-block:: console
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g') \
&& wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin \
&& sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600
Install the CUDA repository GPG key:

.. code-block:: console
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub \
&& echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list
Update the ``apt`` repository cache and install the driver using the ``cuda-drivers`` meta-package. Use the ``--no-install-recommends`` option for a lean driver install
without any dependencies on X packages. This is particularly useful for headless installations on cloud instances:

.. code-block:: console
$ sudo apt-get update \
&& sudo apt-get -y install cuda-drivers
**************
Install Docker
**************

Use the official Docker script to install the latest release of Docker:

.. code-block:: console
$ curl https://get.docker.com | sh
.. code-block:: console
$ sudo systemctl --now enable docker
********************************
Install NVIDIA Container Toolkit
********************************

To run GPU accelerated containers in Docker, NVIDIA Container Toolkit for Docker is required.

Setup the ``stable`` repository and the GPG key:

.. code-block:: console
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
Install the NVIDIA runtime packages (and their dependencies) after updating the package listing:

.. code-block:: console
$ sudo apt-get update \
&& sudo apt-get install -y nvidia-docker2
Since Kubernetes does not support the ``--gpus`` option with Docker yet, the ``nvidia`` runtime should be setup as the
default container runtime for Docker on the GPU node. This can be done by adding the ``default-runtime`` line into the Docker daemon
config file, which is usually located on the system at ``/etc/docker/daemon.json``:

.. code-block:: console
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Restart the Docker daemon to complete the installation after setting the default runtime:

.. code-block:: console
$ sudo systemctl restart docker
At this point, a working setup can be tested by running a base CUDA container:

.. code-block:: console
$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
You should observe an output as shown below:

.. code-block:: console
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 34C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
****************************
Install NVIDIA Device Plugin
****************************

To use GPUs in Kubernetes, the `NVIDIA Device Plugin <https://github.com/NVIDIA/k8s-device-plugin/>`_ is required.
The NVIDIA Device Plugin is a daemonset that automatically enumerates the number of GPUs on each node of the cluster
and allows pods to be run on GPUs.

The preferred method to deploy the device plugin is as a daemonset using ``helm``. First, install Helm:

.. code-block:: console
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh
Add the ``nvidia-device-plugin`` ``helm`` repository:

.. code-block:: console
$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
&& helm repo update
Deploy the device plugin:

.. code-block:: console
$ helm install --generate-name nvdp/nvidia-device-plugin
For more user configurable options while deploying the daemonset, refer to the `documentation <https://github.com/NVIDIA/k8s-device-plugin/#deployment-via-helm>`_

At this point, all the pods should be deployed:

.. code-block:: console
$ kubectl get pods -A
.. code-block:: console
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-5fbfc9dfb6-2ttkk 1/1 Running 3 9d
kube-system calico-node-5vfcb 1/1 Running 3 9d
kube-system coredns-66bff467f8-jzblc 1/1 Running 4 9d
kube-system coredns-66bff467f8-l85sz 1/1 Running 3 9d
kube-system etcd-ip-172-31-81-185 1/1 Running 4 9d
kube-system kube-apiserver-ip-172-31-81-185 1/1 Running 3 9d
kube-system kube-controller-manager-ip-172-31-81-185 1/1 Running 3 9d
kube-system kube-proxy-86vlr 1/1 Running 3 9d
kube-system kube-scheduler-ip-172-31-81-185 1/1 Running 4 9d
kube-system nvidia-device-plugin-1595448322-42vgf 1/1 Running 2 9d
To test whether CUDA jobs can be deployed, run a sample CUDA ``vectorAdd`` application:

The pod spec is shown for reference below, which requests 1 GPU:

.. code-block:: console
apiVersion: v1
kind: Pod
metadata:
name: gpu-operator-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda10.2"
resources:
limits:
nvidia.com/gpu: 1
Save this podspec as ``gpu-pod.yaml``. Now, deploy the application:

.. code-block:: console
$ kubectl apply -f gpu-pod.yaml
Check the logs to ensure the app completed successfully:

.. code-block:: console
$ kubectl get pods gpu-operator-test
.. code-block:: console
NAME READY STATUS RESTARTS AGE
gpu-operator-test 0/1 Completed 0 9d
And check the logs of the ``gpu-operator-test`` pod:

.. code-block:: console
$ kubectl logs gpu-operator-test
.. code-block:: console
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Loading

0 comments on commit 6eb8916

Please sign in to comment.