DKP 2.5.0 Known Issues and Limitations

The following items are known issues with this release.

AWS `additionalTags` cannot contain spaces

Due to an upstream bug in the cluster-api-provider-aws component, it is not possible to specify tags with spaces in their name in the additionalTagssection of an AWSCluster. If you have any tags like this during an upgrade of the capi-components, you may receive a validation error, and will need to remove any such tags. This issue will be corrected in a future DKP release.

Use Static Credentials to Provision an Azure Cluster

Only static credentials can be used when provisioning an Azure cluster.

Containerd 1.4.13 File Limit Issue

In this version of DKP, we introduced containerd 1.6.17. The systemd unit for containerd 1.6.17 provided upstream removes all file number limits (LimitNOFILE=infinity). In our testing, we found that removing these limits broke some IO sensitive applications like Rook Ceph and HAProxy. Because of this, the KIB version included in this release sets the LimitNOFILE value in the containerd systemd unit to the value (1048576) that was used in previous containerd 1.4.13 version releases.

Intermittent Error Status when Creating EKS Clusters in the UI

When provisioning an EKS cluster through the UI, you may receive a brief error state because the EKS cluster may sporadically lose connectivity with the management cluster which results in the following symptoms:

The UI shows the cluster is in an error state.
The kubeconfig generated and retrieved from Kommander ceases to work.
Applications created on the management cluster may not be immediately federated to managed EKS clusters.

After a few moments, the error will resolve, without any action on your part. A new kubeconfig generated and retrieved from Kommander then works properly, and the UI shows that it is working again. In the meantime, you can continue to use the UI to work on the cluster such as deploy applications, create projects, and add roles.

Installation Issue in Pre-provisioned Environments

An issue with Rook Ceph’s deployment prevents pre-provisioned environments from installing this DKP version. To solve this issue, you must set up a minimum of 40 GB of raw storage for your worker nodes and customize your Rook Ceph installation as indicated in Install Kommander in a Pre-provisioned Environment.

Resolve issues with failed HelmReleases

An issue with the Flux helm-controller can cause HelmReleases to fail with the error message Helm upgrade failed: another operation (install/upgrade/rollback) is in progress. This can happen when the helm-controller is restarted while a HelmRelease is still upgrading, or installing.

Workaround

To ensure the HelmRelease error was caused by the helm-controller restarting, first try to suspend/resume the HelmRelease:

CODE

kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'

This might resolve the issue. If not, continue with the following steps:

You should see the HelmRelease attempting to reconcile, and then it either succeeds (with status: Release reconciliation succeeded) or it fails with the same error as before.

If the HelmRelease is still in the failed state, it is likely related to the helm-controller restarting. For example, if the 'reloader' HelmRelease is the one that is stuck.

To resolve the issue, follow these steps:

List secrets containing the affected HelmRelease name:

CODE

kubectl get secrets -n ${NAMESPACE} | grep reloader

The output should look like this:

CODE

kommander-reloader-reloader-token-9qd8b                        kubernetes.io/service-account-token   3      171m
sh.helm.release.v1.kommander-reloader.v1                       helm.sh/release.v1                    1      171m
sh.helm.release.v1.kommander-reloader.v2                       helm.sh/release.v1                    1      117m

In this example, sh.helm.release.v1.kommander-reloader.v2 is the most recent revision.

Find and delete the most recent revision secret, for example, sh.helm.release.v1.*.<revision>:
CODE
```
kubectl delete secret -n <namespace> <most recent helm revision secret name>
```

Suspend and resume the HelmRelease to trigger a reconciliation:

CODE

kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'

You should see the HelmRelease is reconciled and eventually the upgrade and install succeeds.

Limitations to Disk Resizing in vSphere

The DKP CLI flags --control-plane-disk-size and --worker-disk-size are unable to resize the root file system of VMs created using OS images. The flags work by resizing the primary disk of the VM. When the VM boots, the root file system is expanded to fill the disk, but that expansion does not work for some file systems, for example, for file systems contained in an LVM Logical Volume. To ensure your root file system has the size you expect, please see Create a vSphere Base OS Image | Disk-Size.

Error Status in Grafana Logging Dashboard with EKS Clusters

Currently, it is not possible to use FluentBit to collect Admin-level logs on a managed EKS cluster.

If you have these logs enabled, the following message appears when you access the Kubernetes Audit Dashboard in the Grafana Logging Dashboard:

CODE

Cannot read properties of undefined (reading '0')

Logging Operator Upgrade Error

There is a race condition that could result in the logging-operator-logging-fluentd using the incorrect image tag during upgrade from DKP 2.4.0 to DKP 2.5.0.

The image tag is eventually corrected by the logging-operator, however due to the nature of StatefulSets, the failing pod needs to be removed in order for the StatefulSet to continue rolling out the required updates.

Run this command to find if any Fluentd pods are in the ImagePullBackOff state post-upgrade:

CODE

kubectl get pod -l app.kubernetes.io/name=fluentd,app.kubernetes.io/managed-by=logging-operator-logging,app.kubernetes.io/component=fluentd -n kommander

If ImagePullBackOff is present in the output like in the example below, you will need to continue with these steps to resolve the issue.

CODE

NAME                                 READY   STATUS             RESTARTS   AGE
logging-operator-logging-fluentd-0   3/3     Running            0          6m41s
logging-operator-logging-fluentd-1   2/3     ImagePullBackOff   0          2m33s

Delete the Fluentd pod that is in an ImagePullBackOff state. In this case, it is logging-operator-logging-fluentd-1:
CODE
```
kubectl delete pod -n kommander logging-operator-logging-fluentd-1
```
The upgrade of the logging-operator-logging-fluentd StatefulSet now proceeds as normal.

Nodepools Update Error with Knative

The following only applies to your environment if you have Knative installed and if its deployment is scaled to less than 5 pods.
An issue with the PodDisruptionBudget resource blocks the deletion of old nodes when upgrading from DKP 2.4.0 to DKP 2.5.0, which results in a failure of the DKP nodepools upgrade.

If the dkp update nodepool command fails, check to see if PodDisruptionBudget with ALLOWED DISRUPTIONS equals 0 using the following command:

CODE

kubectl get pdb -n knative-serving
NAME                             MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
activator-pdb                    80%             N/A               0                     22h
webhook-pdb                      80%             N/A               0                     22h

Obtain the list of pods in the knative-serving namespace containing PDB with the following command:
CODE
```
kubectl get pods -n knative-serving -l 'app in (webhook, activator)'
```

The output should look similar to the following:

CODE

NAME                       READY   STATUS    RESTARTS   AGE
webhook-XXXXXXXXX-XXXX     2/2     Running   0          5d21h
activator-XXXXXXXXX-XXXX   2/2     Running   0          4d23h

Delete the pods that contains the PodDisruptionBudget resource:

CODE

kubectl delete pod -n knative-serving activator-XXXX-XXX webhook-XXXXX-XXXX

The upgrade of DKP and Knative now proceeds as normal. Re-run the dkp update nodepool command.

Rook Ceph Install Error

An issue may emerge when installing rook-ceph on vSphere clusters using RHEL operating systems.

This issue occurs during initial installation of rook-ceph, causing the object store used by Velero and Grafana Loki, to be unavailable. If the installation of Kommander component of DKP is unsuccessful due to rook-ceph failing, you might need to apply the following workaround.

Run the following command to see if the cluster is affected by this issue.
CODE
```
kubectl describe CephObjectStores dkp-object-store -n kommander
```

If the following output appears, this workaround needs to be applied:

CODE

Name:         dkp-object-store
Namespace:    kommander
...
  Warning  ReconcileFailed     7m55s (x19 over 52m)
  rook-ceph-object-controller  failed to reconcile CephObjectStore
  "kommander/dkp-object-store". failed to create object store deployments: failed
  to configure multisite for object store: failed create ceph multisite for
  object-store ["dkp-object-store"]: failed to commit config changes after
  creating multisite config for CephObjectStore "kommander/dkp-object-store":
  failed to commit RGW configuration period changes%!(EXTRA []string=[]): signal: interrupt

Kubectl exec into the rook-ceph-tools pod.

CODE

export WORKSPACE_NAMESPACE=<workspace namespace>
CEPH_TOOLS_POD=$(kubectl get pods -l app=rook-ceph-tools -n ${WORKSPACE_NAMESPACE} -o name)
kubectl exec -it -n ${WORKSPACE_NAMESPACE} $CEPH_TOOLS_POD bash

Run the following commands to set dkp-object-store as the default zonegroup.
NOTE: The period update command may take a few minutes to complete
CODE
```
radosgw-admin zonegroup default --rgw-zonegroup=dkp-object-store
radosgw-admin period update --commit
```
Next, restart the rook-ceph-operator deployment for the CephobjectStore to be reconciled.
CODE
```
kubectl rollout restart deploy -n${WORKSPACE_NAMESPACE} rook-ceph-operator
```
After running the commands above, the CephObjectStore should be Connected once the rook-ceph operator reconciles the object (this may take some time).
CODE
```
kubectl wait CephObjectStore --for=jsonpath='{.status.phase}'=Connected dkp-object-store -n ${WORKSPACE_NAMESPACE} --timeout 10m
```

Post Upgrade, Volume Cannot Attach to a Node Already Attached to Another Node

Due to an upstream issue, when you bring a new node up during a Kubernetes version upgrade and then delete the old node, an existing volume might not attach to the new node.

You will see this when a new pod that uses a volume does not become ready in the new node, and then an event that says something such as Volume <pvc/pv-id> is already exclusively attached to one node and can’t be attached to another.

This will be fixed in a future Kubernetes release, for example this is described in vSphere here. Different methods might be needed to resolve this manually, including this method to resolve on vSphere.

DKP 2.4.0 to DKP 2.50 `rook-ceph-cluster` Helm Release Upgrade Error

If you see the rook-ceph-cluster HelmRelease with an error similar to this:

CODE

status:
  conditions:
  - lastTransitionTime: "2023-05-10T13:56:28Z"
    message: "Helm rollback failed: cannot patch \\"dkp-ceph-cluster\\" with kind CephCluster:
      Internal error occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused && cannot patch \\"dkp-object-store\\"
      with kind CephObjectStore: Internal error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused\\n\\nLast Helm logs:\\n\\nPatch
      CephCluster \\"dkp-ceph-cluster\\" in namespace kommander\\nerror updating the
      resource \\"dkp-ceph-cluster\\":\\n\\t cannot patch \\"dkp-ceph-cluster\\" with kind
      CephCluster: Internal error occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused\\nPatch CephObjectStore
      \\"dkp-object-store\\" in namespace kommander\\nerror updating the resource \\"dkp-object-store\\":\\n\\t
      cannot patch \\"dkp-object-store\\" with kind CephObjectStore: Internal error
      occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused\\nwarning: Rollback \\"rook-ceph-cluster\\"
      failed: cannot patch \\"dkp-ceph-cluster\\" with kind CephCluster: Internal error
      occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused && cannot patch \\"dkp-object-store\\"
      with kind CephObjectStore: Internal error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused"
    reason: RollbackFailed
    status: "False"
    type: Ready
  - lastTransitionTime: "2023-05-10T13:56:26Z"
    message: "Helm upgrade failed: cannot patch \\"dkp-object-store\\" with kind CephObjectStore:
      Internal error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused\\n\\nLast Helm logs:\\n\\nPatch
      Ingress \\"dkp-ceph-cluster-dashboard\\" in namespace kommander\\nPatch CephCluster
      \\"dkp-ceph-cluster\\" in namespace kommander\\nPatch CephObjectStore \\"dkp-object-store\\"
      in namespace kommander\\nerror updating the resource \\"dkp-object-store\\":\\n\\t
      cannot patch \\"dkp-object-store\\" with kind CephObjectStore: Internal error
      occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused\\nwarning: Upgrade \\"rook-ceph-cluster\\"
      failed: cannot patch \\"dkp-object-store\\" with kind CephObjectStore: Internal
      error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused"
    reason: UpgradeFailed
    status: "False"
    type: Released
  - lastTransitionTime: "2023-05-10T13:56:28Z"
    message: "Helm rollback failed: cannot patch \\"dkp-ceph-cluster\\" with kind CephCluster:
      Internal error occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused && cannot patch \\"dkp-object-store\\"
      with kind CephObjectStore: Internal error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused\\n\\nLast Helm logs:\\n\\nPatch
      CephCluster \\"dkp-ceph-cluster\\" in namespace kommander\\nerror updating the
      resource \\"dkp-ceph-cluster\\":\\n\\t cannot patch \\"dkp-ceph-cluster\\" with kind
      CephCluster: Internal error occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused\\nPatch CephObjectStore
      \\"dkp-object-store\\" in namespace kommander\\nerror updating the resource \\"dkp-object-store\\":\\n\\t
      cannot patch \\"dkp-object-store\\" with kind CephObjectStore: Internal error
      occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused\\nwarning: Rollback \\"rook-ceph-cluster\\"
      failed: cannot patch \\"dkp-ceph-cluster\\" with kind CephCluster: Internal error
      occurred: failed calling webhook \\"cephcluster-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephcluster?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused && cannot patch \\"dkp-object-store\\"
      with kind CephObjectStore: Internal error occurred: failed calling webhook \\"cephobjectstore-wh-rook-ceph-admission-controller-kommander.rook.io\\":
      failed to call webhook: Post \\"<https://rook-ceph-admission-controller.kommander.svc:443/validate-ceph-rook-io-v1-cephobjectstore?timeout=5s\\":>
      dial tcp 10.99.16.90:443: connect: connection refused"
    reason: RollbackFailed
    status: "False"
    type: Remediated

Reconcile the rook-ceph-cluster HelmRelease once the rook-ceph HelmRelease becomes ready using the commands below. The upgrade now proceeds as normal.

CODE

export WORKSPACE_NAMESPACE=<workspace namespace>
kubectl -n ${WORKSPACE_NAMESPACE} patch helmrelease rook-ceph-cluster --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
kubectl -n ${WORKSPACE_NAMESPACE} patch helmrelease rook-ceph-cluster --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'

AWS additionalTags cannot contain spaces

Use Static Credentials to Provision an Azure Cluster

Containerd 1.4.13 File Limit Issue

Intermittent Error Status when Creating EKS Clusters in the UI

Installation Issue in Pre-provisioned Environments

Resolve issues with failed HelmReleases

Workaround

Limitations to Disk Resizing in vSphere

Error Status in Grafana Logging Dashboard with EKS Clusters

Logging Operator Upgrade Error

Nodepools Update Error with Knative

Rook Ceph Install Error

Post Upgrade, Volume Cannot Attach to a Node Already Attached to Another Node

DKP 2.4.0 to DKP 2.50 rook-ceph-cluster Helm Release Upgrade Error

AWS `additionalTags` cannot contain spaces

DKP 2.4.0 to DKP 2.50 `rook-ceph-cluster` Helm Release Upgrade Error