Skip to main content
Skip table of contents

Kommander GPU configuration

Configure GPU for Kommander clusters

Prerequisites

Before you begin, you must:

  • Ensure nodes provide an NVIDIA GPU.

  • If you are using a public cloud service such as AWS, create an AMI with KIB using the instructions on the KIB for GPU page.

  • If you are deploying in a pre-provisioned environment, ensure that you have created the appropriate secret for your GPU nodepool and have uploaded the appropriate artifacts to each node. See the GPU only steps section on the Pre-provisioned Prerequisites Air-gapped page for additional information.

Specific instructions must be followed for enabling nvidia-gpu-operator depending on if you want to deploy the app on a Management cluster or a Attached or a Managed cluster.

Once nvidia-gpu-operator has been enabled depending on the cluster type, proceed to the Select the correct Toolkit version for your NVIDIA GPU Operator section.

Enable NVIDIA Platform Application on Kommander (Management Cluster)

If you intend to run applications that make use of GPU’s on your cluster, you should install the NVIDIA GPU operator. To enable NVIDIA GPU support when installing Kommander on a management cluster, perform the following steps:

  1. Create an installation configuration file:

    CODE
    dkp install kommander --init > install.yaml
  2. Append the following to the apps section in the install.yaml file to enable Nvidia platform services.

    CODE
    apps:
      nvidia-gpu-operator:
       enabled: true
  3. Install Kommander using the configuration file you created:

    CODE
    dkp install kommander --installer-config ./install.yaml
  4. Proceed to the Select the correct Toolkit version for your NVIDIA GPU Operator section.

Enable NVIDIA Platform Application on Attached or Managed Clusters

If you intend to run applications that utilize GPU’s on Attached or Managed clusters, you must enable the nvidia-gpu-operator platform application in the workspace.

To use the UI to enable the application, refer to the Platform Applications | Customize-a-workspace’s-applications page.

To use the CLI, refer to the Deploy Platform Applications via CLI page.

If only a subset of attached or managed clusters in the workspace are utilizing GPU’s, refer to Enable an Application per Cluster on how to only enable the nvidia-gpu-operator on specific clusters.

After you have enabled the nvidia-gpu-operator app in the workspace on the necessary clusters, proceed to the next section.

Select the Correct Toolkit Version for your NVIDIA GPU Operator

The NVIDIA Container Toolkit allows users to run GPU accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to leverage NVIDIA GPU and must be configured correctly according to your base operating system.

Kommander (Management Cluster) Customization

  1. Select the correct Toolkit version based on your OS:

    Centos 7.9/RHEL 7.9:
    If you’re using Centos 7.9 or RHEL 7.9 as the base operating system for your GPU enabled nodes, set the toolkit.version parameter in your install.yaml to the following:

    CODE
    kind: Installation
    apps:
      nvidia-gpu-operator:
        values: |
          toolkit:
            version: v1.10.0-centos7


    RHEL 8.4/8.6 and SLES 15 SP3
    If you’re using RHEL 8.4/8.6 or SLES 15 SP3 as the base operating system for your GPU enabled nodes, set the toolkit.version parameter in your install.yaml to the following:

    CODE
    kind: Installation
    apps:
      nvidia-gpu-operator:
        values: |
          toolkit:
            version: v1.10.0-ubi8


    Ubuntu 18.04 and 20.04
    If you’re using Ubuntu 18.04 or 20.04 as the base operating system for your GPU enabled nodes, set the toolkit.version parameter in your install.yaml to the following:

    CODE
    kind: Installation
    apps:
      nvidia-gpu-operator:
        values: |
          toolkit:
            version: v1.11.0-ubuntu20.04
  2. Install Kommander, using the configuration file you created:

    CODE
    dkp install kommander --installer-config ./install.yaml

Workspace (Attached and Managed clusters) Customization

Refer to AppDeployment resources for how to use the CLI to customize the platform application on a workspace.

If specific attached/managed clusters in the workspace require different configurations, refer to Customize an Application per Cluster for how to do this.

  1. Select the correct Toolkit version based on your OS and create a ConfigMap with these configuration override values:

    Centos 7.9/RHEL 7.9:
    If you’re using Centos 7.9 or RHEL 7.9 as the base operating system for your GPU enabled nodes, set the toolkit.version parameter in your install.yaml to the following:

    CODE
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      namespace: ${WORKSPACE_NAMESPACE}
      name: nvidia-gpu-operator-overrides-attached
    data:
      values.yaml: |
        toolkit:
          version: v1.10.0-centos7
    EOF


    RHEL 8.4/8.6 and SLES 15 SP3
    If you’re using RHEL 8.4/8.6 or SLES 15 SP3 as the base operating system for your GPU enabled nodes, set the toolkit.version parameter in your install.yaml to the following:

    CODE
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      namespace: ${WORKSPACE_NAMESPACE}
      name: nvidia-gpu-operator-overrides-attached
    data:
      values.yaml: |
        toolkit:
          version: v1.10.0-ubi8
    EOF

    Ubuntu 18.04 and 20.04
    If you’re using Ubuntu 18.04 or 20.04 as the base operating system for your GPU enabled nodes, set the toolkit.version parameter in your install.yaml to the following:

    CODE
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      namespace: ${WORKSPACE_NAMESPACE}
      name: nvidia-gpu-operator-overrides-attached
    data:
      values.yaml: |
        toolkit:
          version: v1.11.0-ubuntu20.04
    EOF
  2. Note the name of this ConfigMap (nvidia-gpu-operator-overrides-attached) and use it to set the necessary nvidia-gpu-operator AppDeployment spec fields depending on the scope of the override. Alternatively, you can also use the UI to pass in the configuration overrides for the app per workspace or per cluster.

Validate that the Application has Started Correctly

Run the following command to validate that your application has started correctly:

CODE
kubectl get pods -A | grep nvidia

The output should be similar to the following:

CODE
nvidia-container-toolkit-daemonset-7h2l5   1/1 	Running 	0          	   150m
nvidia-container-toolkit-daemonset-mm65g   1/1 	Running 	0          	   150m
nvidia-container-toolkit-daemonset-mv7xj   1/1 	Running 	0          	   150m
nvidia-cuda-validator-pdlz8            	0/1 	Completed   0          	   150m
nvidia-cuda-validator-r7qc4            	0/1 	Completed   0          	   150m
nvidia-cuda-validator-xvtqm            	0/1 	Completed   0          	   150m
nvidia-dcgm-exporter-9r6rl             	1/1 	Running 	1 (149m ago)   150m
nvidia-dcgm-exporter-hn6hn             	1/1 	Running 	1 (149m ago)   150m
nvidia-dcgm-exporter-j7g7g             	1/1 	Running 	0          	   150m
nvidia-dcgm-jpr57                      	1/1 	Running 	0          	   150m
nvidia-dcgm-jwldh                      	1/1 	Running 	0          	   150m
nvidia-dcgm-qg2vc                      	1/1 	Running 	0          	   150m
nvidia-device-plugin-daemonset-2gv8h   	1/1 	Running 	0          	   150m
nvidia-device-plugin-daemonset-tcmgk   	1/1 	Running 	0          	   150m
nvidia-device-plugin-daemonset-vqj88   	1/1 	Running 	0          	   150m
nvidia-device-plugin-validator-9xdqr   	0/1 	Completed   0          	   149m
nvidia-device-plugin-validator-jjhdr   	0/1 	Completed   0          	   149m
nvidia-device-plugin-validator-llxjk   	0/1 	Completed   0          	   149m
nvidia-operator-validator-9kzv4        	1/1 	Running 	0          	   150m
nvidia-operator-validator-fvsr7        	1/1 	Running 	0          	   150m
nvidia-operator-validator-qr9cj        	1/1 	Running 	0          	   150m

If you are seeing errors, ensure that you set the container toolkit version appropriately based on your OS, as described in the previous section.

NVIDIA GPU Monitoring

Kommander uses the NVIDIA Data Center GPU Manager to export GPU metrics towards Prometheus. By default, Kommander has a Grafana dashboard called NVIDIA DCGM Exporter Dashboard to monitor GPU metrics. This GPU dashboard is shown in Kommander’s Grafana UI.

NVIDIA MIG Settings

MIG stands for Multi-Instance-GPU. It is a mode of operation for future NVIDIA GPUs that allows the user to partition a GPU into a set of MIG devices. Each set appears to the software that is consuming them as a mini-GPU with a fixed partition of memory and a fixed partition of compute resources.

NOTE: MIG is only available for the following NVIDIA devices: H100, A100, and A30.

To Configure MIG

  1. Set the MIG strategy according to your GPU topology.
    mig.strategy should be set to mixed when MIG mode is not enabled on all GPUs on a node.
    mig.strategy should be set to single when MIG mode is enabled on all GPUs on a node and they have the same MIG device types across all of them.

    For the Management Cluster, this can be set at install time by modifying the Kommander configuration file to add configuration for the nvidia-gpu-operator application:

    CODE
    apiVersion: config.kommander.mesosphere.io/v1alpha1
    kind: Installation
    apps:
      nvidia-gpu-operator:
        values: |
          mig:
            strategy: single
    ...

    Or by modifying the clusterPolicy object for the GPU operator once it has already been installed.

  2. Set the MIG profile for the GPU you are using. In our example, we are using the A30 GPU that supports the following MIG profiles:

    CODE
    4 GPU instances @ 6GB each
    2 GPU instances @ 12GB each
    1 GPU instance @ 24GB

    Set the mig profile by labeling the node ${NODE} with the profile as in the example below:

    CODE
    kubectl label nodes ${NODE} nvidia.com/mig.config=all-1g.6gb --overwrite

  3. Check the node labels to see if the changes were applied to your MIG enabled GPU node

    CODE
    kubectl get no -o json | jq .items[0].metadata.labels


    CODE
    "nvidia.com/mig.config": "all-1g.6gb",
      "nvidia.com/mig.config.state": "success",
      "nvidia.com/mig.strategy": "single"
  4. Deploy a sample workload:

    CODE
    apiVersion: v1
    kind: Pod
    metadata:
      name: cuda-vector-add
    spec:
      restartPolicy: OnFailure
      containers:
          - name: cuda-vectoradd
          image: "nvidia/samples:vectoradd-cuda11.2.1"
          resources:
            limits:
              nvidia.com/gpu: 1
    
      nodeSelector:
        "nvidia.com/gpu.product": NVIDIA-A30-MIG-1g.6gb


    If the workload successfully finishes, then your GPU has been properly MIG partitioned.

Troubleshooting NVIDIA GPU Operator on Kommander

In case you run into any errors with NVIDIA GPU Operator, here are a couple commands you can run to troubleshoot:

  1. Connect (using SSH or similar) to your GPU enabled nodes and run the nvidia-smi command. Your output should be similar to the following example:

    CODE
      [ec2-user@ip-10-0-0-241 ~]$ nvidia-smi
      Thu Nov  3 22:52:59 2022       
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
      |-------------------------------+----------------------+----------------------+
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
      | N/A   54C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
      |                               |                      |                  N/A |
      +-------------------------------+----------------------+----------------------+                                                                        
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      |  No running processes found                                                 |
      +-----------------------------------------------------------------------------+
  2. Another common issue is having a misconfigured toolkit version, resulting in NVIDIA pods in a bad state
    For example:

    CODE
    nvidia-container-toolkit-daemonset-jrqt2                          1/1     Running                 0             29s
    nvidia-dcgm-exporter-b4mww                                        0/1     Error                   1 (9s ago)    16s
    nvidia-dcgm-pqsz8                                                 0/1     CrashLoopBackOff        1 (13s ago)   27s
    nvidia-device-plugin-daemonset-7fkzr                              0/1     Init:0/1                0             14s
    nvidia-operator-validator-zxn4w                                   0/1     Init:CrashLoopBackOff   1 (7s ago)    11s

    To modify the toolkit version, run the following commands to modify the AppDeployment for the nvidia gpu operator application:

    • Provide the name of a ConfigMap with the custom configuration in the AppDeployment:

    CODE
    cat <<EOF | kubectl apply -f -
    apiVersion: apps.kommander.d2iq.io/v1alpha3
    kind: AppDeployment
    metadata:
      name: nvidia-gpu-operator
      namespace: kommander
    spec:
      appRef:
        kind: ClusterApp
        name: nvidia-gpu-operator-1.11.1
      configOverrides:
        name: nvidia-gpu-operator-overrides
    EOF
    


    • Create the ConfigMap with the name provided in the previous step, which provides the custom configuration on top of the default configuration in the config map, set the version appropriately:

    CODE
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: ConfigMap
    metadata:
      namespace: kommander
      name: nvidia-gpu-operator-overrides
    data:
      values.yaml: |
        toolkit:
          version: v1.10.0-centos7
    EOF
  3. If a node has an NVIDIA GPU installed and the nvidia-gpu-operator application is enabled on the cluster, but the node is still not accepting GPU workloads, it's possible that the nodes do not have a label that indicates there is an NVIDIA GPU present.
    By default the GPU operator will attempt to configure nodes with the following labels present, which are usually applied by the node feature discovery component:

    CODE
    	"feature.node.kubernetes.io/pci-10de.present":      "true",
    	"feature.node.kubernetes.io/pci-0302_10de.present": "true",
    	"feature.node.kubernetes.io/pci-0300_10de.present": "true",

    If these labels are not present on a node that you know contains an NVIDIA GPU, you can manually label the node using the following command:

    CODE
    kubectl label node ${NODE} feature.node.kubernetes.io/pci-0302_10de.present=true


Disable NVIDIA GPU Operator Platform Application on Kommander

  1. Delete all GPU workloads on the GPU nodes where the NVIDIA GPU Operator platform application is present.

  2. Delete the existing NVIDIA GPU Operator AppDeployment using the following command:

    CODE
    kubectl delete appdeployment -n kommander nvidia-gpu-operator

  3. Wait for all NVIDIA related resources in the Terminating state to be cleaned up. You can check pod status with the following command:

    CODE
    kubectl get pods -A | grep nvidia

For information on how to delete nodepools, refer to Pre-provisioned Create and Delete Node Pools

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.