Skip to main content
Skip table of contents

Kommander GPU configuration

Configure GPU for Kommander clusters

Prerequisites

Before you begin, you must:

  • Ensure nodes provide an Nvidia GPU.

  • For AWS, select a GPU instance type from the Accelerated Computing section of the AWS instance types.

  • Run nodes on CentOS 7.

  • Perform the Node Deployment procedure.

Enable Nvidia Platform Service on Kommander

To enable Nvidia GPU support when installing Kommander, perform the following steps:

  1. Create an installation configuration file:

    CODE
    dkp install kommander --init > install.yaml
  2. Append the following to the apps section in the install.yaml file to enable Nvidia platform services.

    CODE
    apps:
      nvidia:
        enabled: true
  3. Install Kommander, using the configuration file you created:

    CODE
    dkp install kommander --installer-config ./install.yaml

Disable Nvidia Platform Service on Kommander

  1. Delete all GPU workloads on the GPU nodes where the Nvidia platform service needs to be upgraded.

  2. Delete the existing Nvidia platform service.

  3. Wait for all Nvidia-related resources in the Terminating state to be cleaned up. You can check pod status with:

    CODE
    kubectl get pods -A | grep nvidia

Upgrade Nvidia Platform Service on Kommander

Kommander can automatically upgrade the Nvidia GPU platform service. However, GPU workload must be drained before the Nvidia platform service can be upgraded.

To upgrade, follow the instructions to disable the service, and then the instructions to enable the service.

Nvidia GPU Monitoring

Kommander uses the NVIDIA Data Center GPU Manager to export GPU metrics towards Prometheus. By default, Kommander has a Grafana dashboard called NVIDIA DCGM Exporter Dashboard to monitor GPU metrics. This GPU dashboard is shown in Kommander’s Grafana UI.

Troubleshooting

  1. Determine if all Nvidia pods are in Running state, as expected:

    CODE
    kubectl get pods -A | grep nvidia
  2. Collect the logs for any problematic Nvidia pods, if there are any crashing, returning errors, or flapping. For example:

    CODE
    kubectl logs -n kube-system nvidia-nvidia-device-plugin-rpdwj
  3. To recover from this problem, you must restart all Nvidia platform service pods that are running on the SAME host. In the example below, all Nvidia resources are restarted on the node ip-10-0-101-65.us-west-2.compute.internal:

    CODE
    $ kubectl get pod -A -o wide | grep nvidia
    kommander                           nvidia-nvidia-dcgm-exporter-s26r7                                    1/1     Running     0          51m     192.168.167.174   ip-10-0-101-65.us-west-2.compute.internal    <none>           <none>
    kommander                           nvidia-nvidia-dcgm-exporter-w7lf4                                    1/1     Running     0          51m     192.168.111.173   ip-10-0-75-212.us-west-2.compute.internal    <none>           <none>
    kube-system                         nvidia-nvidia-device-plugin-rpdwj                                    1/1     Running     0          51m     192.168.167.175   ip-10-0-101-65.us-west-2.compute.internal    <none>           <none>
    kube-system                         nvidia-nvidia-device-plugin-z7m2s                                    1/1     Running     0          51m     192.168.111.172   ip-10-0-75-212.us-west-2.compute.internal    <none>           <none>
    $ kubectl delete pod -n kommander nvidia-nvidia-dcgm-exporter-s26r7
    pod "nvidia-nvidia-dcgm-exporter-s26r7" deleted
    $ kubectl delete pod -n kube-system nvidia-nvidia-device-plugin-rpdwj
    pod "nvidia-nvidia-device-plugin-rpdwj" deleted
  4. To collect more debug information on the Nvidia platform service, run:

    CODE
    helm get all nvidia -n kommander
  5. To validate metrics being produced and exported by the Nvidia DCGM exporter on a GPU node:

    CODE
    kubectl exec -n kommander nvidia-nvidia-dcgm-exporter-s26r7 --tty -- wget -nv -O- localhost:9400/metrics
JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.