Kommander GPU configuration
Configure GPU for Kommander clusters
Prerequisites
Before you begin, you must:
Ensure nodes provide an Nvidia GPU.
For AWS, select a GPU instance type from the Accelerated Computing section of the AWS instance types.
Run nodes on CentOS 7.
Perform the Node Deployment procedure.
Enable Nvidia Platform Service on Kommander
To enable Nvidia GPU support when installing Kommander, perform the following steps:
Create an installation configuration file:
CODEdkp install kommander --init > install.yaml
Append the following to the apps section in the
install.yaml
file to enable Nvidia platform services.CODEapps: nvidia: enabled: true
Install Kommander, using the configuration file you created:
CODEdkp install kommander --installer-config ./install.yaml
Disable Nvidia Platform Service on Kommander
Delete all GPU workloads on the GPU nodes where the Nvidia platform service needs to be upgraded.
Delete the existing Nvidia platform service.
Wait for all Nvidia-related resources in the
Terminating
state to be cleaned up. You can check pod status with:CODEkubectl get pods -A | grep nvidia
Upgrade Nvidia Platform Service on Kommander
Kommander can automatically upgrade the Nvidia GPU platform service. However, GPU workload must be drained before the Nvidia platform service can be upgraded.
To upgrade, follow the instructions to disable the service, and then the instructions to enable the service.
Nvidia GPU Monitoring
Kommander uses the NVIDIA Data Center GPU Manager to export GPU metrics towards Prometheus. By default, Kommander has a Grafana dashboard called NVIDIA DCGM Exporter Dashboard
to monitor GPU metrics. This GPU dashboard is shown in Kommander’s Grafana UI.
Troubleshooting
Determine if all Nvidia pods are in
Running
state, as expected:CODEkubectl get pods -A | grep nvidia
Collect the logs for any problematic Nvidia pods, if there are any crashing, returning errors, or flapping. For example:
CODEkubectl logs -n kube-system nvidia-nvidia-device-plugin-rpdwj
To recover from this problem, you must restart all Nvidia platform service pods that are running on the SAME host. In the example below, all Nvidia resources are restarted on the node
ip-10-0-101-65.us-west-2.compute.internal
:CODE$ kubectl get pod -A -o wide | grep nvidia kommander nvidia-nvidia-dcgm-exporter-s26r7 1/1 Running 0 51m 192.168.167.174 ip-10-0-101-65.us-west-2.compute.internal <none> <none> kommander nvidia-nvidia-dcgm-exporter-w7lf4 1/1 Running 0 51m 192.168.111.173 ip-10-0-75-212.us-west-2.compute.internal <none> <none> kube-system nvidia-nvidia-device-plugin-rpdwj 1/1 Running 0 51m 192.168.167.175 ip-10-0-101-65.us-west-2.compute.internal <none> <none> kube-system nvidia-nvidia-device-plugin-z7m2s 1/1 Running 0 51m 192.168.111.172 ip-10-0-75-212.us-west-2.compute.internal <none> <none> $ kubectl delete pod -n kommander nvidia-nvidia-dcgm-exporter-s26r7 pod "nvidia-nvidia-dcgm-exporter-s26r7" deleted $ kubectl delete pod -n kube-system nvidia-nvidia-device-plugin-rpdwj pod "nvidia-nvidia-device-plugin-rpdwj" deleted
To collect more debug information on the Nvidia platform service, run:
CODEhelm get all nvidia -n kommander
To validate metrics being produced and exported by the Nvidia DCGM exporter on a GPU node:
CODEkubectl exec -n kommander nvidia-nvidia-dcgm-exporter-s26r7 --tty -- wget -nv -O- localhost:9400/metrics