DKP 2.5.1 Known Issues and Limitations
The following items are known issues with this release.
AWS additionalTags
cannot contain spaces
Due to an upstream bug in the cluster-api-provider-aws
component, it is not possible to specify tags with spaces in their name in the additionalTags
section of an AWSCluster
. If you have any tags like this during an upgrade of the capi-components
, you may receive a validation error, and will need to remove any such tags. This issue will be corrected in a future DKP release.
Use Static Credentials to Provision an Azure Cluster
Only static credentials can be used when provisioning an Azure cluster.
Containerd 1.4.13 File Limit Issue
In this version of DKP, we introduced containerd 1.6.17. The systemd unit for containerd 1.6.17 provided upstream removes all file number limits (LimitNOFILE=infinity
). In our testing, we found that removing these limits broke some IO sensitive applications like Rook Ceph and HAProxy. Because of this, the KIB version included in this release sets the LimitNOFILE
value in the containerd systemd unit to the value (1048576
) that was used in previous containerd 1.4.13 version releases.
Intermittent Error Status when Creating EKS Clusters in the UI
When provisioning an EKS cluster through the UI, you may receive a brief error state because the EKS cluster may sporadically lose connectivity with the management cluster which results in the following symptoms:
The UI shows the cluster is in an error state.
The kubeconfig generated and retrieved from Kommander ceases to work.
Applications created on the management cluster may not be immediately federated to managed EKS clusters.
After a few moments, the error will resolve, without any action on your part. A new kubeconfig generated and retrieved from Kommander then works properly, and the UI shows that it is working again. In the meantime, you can continue to use the UI to work on the cluster such as deploy applications, create projects, and add roles.
Installation Issue in Pre-provisioned Environments
An issue with Rook Ceph’s deployment prevents pre-provisioned environments from installing this DKP version. To solve this issue, you must set up a minimum of 40 GB of raw storage for your worker nodes and customize your Rook Ceph installation as indicated in Install Kommander in a Pre-provisioned Environment.
Resolve issues with failed HelmReleases
An issue with the Flux helm-controller can cause HelmReleases to fail with the error message Helm upgrade failed: another operation (install/upgrade/rollback) is in progress. This can happen when the helm-controller is restarted while a HelmRelease is still upgrading, or installing.
Workaround
To ensure the HelmRelease error was caused by the helm-controller restarting, first try to suspend/resume the HelmRelease:
kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]'
kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'
This might resolve the issue. If not, continue with the following steps:
You should see the HelmRelease attempting to reconcile, and then it either succeeds (with status: Release reconciliation succeeded) or it fails with the same error as before.
If the HelmRelease is still in the failed state, it is likely related to the helm-controller restarting. For example, if the 'reloader' HelmRelease is the one that is stuck.
To resolve the issue, follow these steps:
List secrets containing the affected HelmRelease name:
CODEkubectl get secrets -n ${NAMESPACE} | grep reloader
The output should look like this:
CODEkommander-reloader-reloader-token-9qd8b kubernetes.io/service-account-token 3 171m sh.helm.release.v1.kommander-reloader.v1 helm.sh/release.v1 1 171m sh.helm.release.v1.kommander-reloader.v2 helm.sh/release.v1 1 117m
In this example,
sh.helm.release.v1.kommander-reloader.v2
is the most recent revision.Find and delete the most recent revision secret, for example,
sh.helm.release.v1.*.<revision>
:CODEkubectl delete secret -n <namespace> <most recent helm revision secret name>
Suspend and resume the HelmRelease to trigger a reconciliation:
CODEkubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": true}]' kubectl -n <namespace> patch helmrelease <HELMRELEASE_NAME> --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'
You should see the HelmRelease is reconciled and eventually the upgrade and install succeeds.
Limitations to Disk Resizing in vSphere
The DKP CLI flags --control-plane-disk-size
and --worker-disk-size
are unable to resize the root file system of VMs created using OS images. The flags work by resizing the primary disk of the VM. When the VM boots, the root file system is expanded to fill the disk, but that expansion does not work for some file systems, for example, for file systems contained in an LVM Logical Volume. To ensure your root file system has the size you expect, please see Create a vSphere Base OS Image | Disk-Size.
Error Status in Grafana Logging Dashboard with EKS Clusters
Currently, it is not possible to use FluentBit to collect Admin-level logs on a managed EKS cluster.
If you have these logs enabled, the following message appears when you access the Kubernetes Audit Dashboard in the Grafana Logging Dashboard:
Cannot read properties of undefined (reading '0')
Logging Operator Upgrade Error
There is a race condition that could result in the logging-operator-logging-fluentd
using the incorrect image tag during upgrade from DKP 2.4.0 to DKP 2.5.0.
The image tag is eventually corrected by the logging-operator
, however due to the nature of StatefulSets, the failing pod needs to be removed in order for the StatefulSet to continue rolling out the required updates.
Run this command to find if any Fluentd pods are in the
ImagePullBackOff
state post-upgrade:CODEkubectl get pod -l app.kubernetes.io/name=fluentd,app.kubernetes.io/managed-by=logging-operator-logging,app.kubernetes.io/component=fluentd -n kommander
If
ImagePullBackOff
is present in the output like in the example below, you will need to continue with these steps to resolve the issue.CODENAME READY STATUS RESTARTS AGE logging-operator-logging-fluentd-0 3/3 Running 0 6m41s logging-operator-logging-fluentd-1 2/3 ImagePullBackOff 0 2m33s
Delete the Fluentd pod that is in an
ImagePullBackOff
state. In this case, it islogging-operator-logging-fluentd-1
:CODEkubectl delete pod -n kommander logging-operator-logging-fluentd-1
The upgrade of the
logging-operator-logging-fluentd
StatefulSet now proceeds as normal.
Nodepools Update Error with Knative
The following only applies to your environment if you have Knative installed and if its deployment is scaled to less than 5 pods.
An issue with the PodDisruptionBudget
resource blocks the deletion of old nodes when upgrading from DKP 2.4.0 to DKP 2.5.0, which results in a failure of the DKP nodepools upgrade.
If the
dkp update nodepool
command fails, check to see ifPodDisruptionBudget
withALLOWED DISRUPTIONS
equals 0 using the following command:CODEkubectl get pdb -n knative-serving NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE activator-pdb 80% N/A 0 22h webhook-pdb 80% N/A 0 22h
Obtain the list of pods in the knative-serving namespace containing PDB with the following command:
CODEkubectl get pods -n knative-serving -l 'app in (webhook, activator)'
The output should look similar to the following:
CODENAME READY STATUS RESTARTS AGE webhook-XXXXXXXXX-XXXX 2/2 Running 0 5d21h activator-XXXXXXXXX-XXXX 2/2 Running 0 4d23h
Delete the pods that contains the
PodDisruptionBudget
resource:CODEkubectl delete pod -n knative-serving activator-XXXX-XXX webhook-XXXXX-XXXX
The upgrade of DKP and Knative now proceeds as normal. Re-run the
dkp update nodepool
command.
Rook Ceph Install Error
An issue may emerge when installing rook-ceph
on vSphere clusters using RHEL operating systems.
This issue occurs during initial installation of rook-ceph, causing the object store used by Velero and Grafana Loki, to be unavailable. If the installation of Kommander component of DKP is unsuccessful due to rook-ceph
failing, you might need to apply the following workaround.
Run the following command to see if the cluster is affected by this issue.
CODEkubectl describe CephObjectStores dkp-object-store -n kommander
If the following output appears, this workaround needs to be applied:
CODEName: dkp-object-store Namespace: kommander ... Warning ReconcileFailed 7m55s (x19 over 52m) rook-ceph-object-controller failed to reconcile CephObjectStore "kommander/dkp-object-store". failed to create object store deployments: failed to configure multisite for object store: failed create ceph multisite for object-store ["dkp-object-store"]: failed to commit config changes after creating multisite config for CephObjectStore "kommander/dkp-object-store": failed to commit RGW configuration period changes%!(EXTRA []string=[]): signal: interrupt
Kubectl exec into the
rook-ceph-tools
pod.CODEexport WORKSPACE_NAMESPACE=<workspace namespace> CEPH_TOOLS_POD=$(kubectl get pods -l app=rook-ceph-tools -n ${WORKSPACE_NAMESPACE} -o name) kubectl exec -it -n ${WORKSPACE_NAMESPACE} $CEPH_TOOLS_POD bash
Run the following commands to set
dkp-object-store
as the default zonegroup.
NOTE: Theperiod update
command may take a few minutes to completeCODEradosgw-admin zonegroup default --rgw-zonegroup=dkp-object-store radosgw-admin period update --commit
Next, restart the
rook-ceph-operator
deployment for theCephobjectStore
to be reconciled.CODEkubectl rollout restart deploy -n${WORKSPACE_NAMESPACE} rook-ceph-operator
After running the commands above, the
CephObjectStore
should beConnected
once therook-ceph
operator reconciles the object (this may take some time).CODEkubectl wait CephObjectStore --for=jsonpath='{.status.phase}'=Connected dkp-object-store -n ${WORKSPACE_NAMESPACE} --timeout 10m
Post Upgrade, Volume Cannot Attach to a Node Already Attached to Another Node
Due to an upstream issue, when you bring a new node up during a Kubernetes version upgrade and then delete the old node, an existing volume might not attach to the new node.
You will see this when a new pod that uses a volume does not become ready in the new node, and then an event that says something such as Volume <pvc/pv-id> is already exclusively attached to one node and can’t be attached to another
.
This will be fixed in a future Kubernetes release, for example this is described in vSphere here. Different methods might be needed to resolve this manually, including this method to resolve on vSphere.
Control Plane Upgrade - Oracle Linux node-feature-discovery Fails
When upgrading the Oracle Linux node-feature-discovery fails to terminate, so the control-plane upgrade process can freeze and fail after the time out.
Follow these steps to avoid this issue:
View the machines for your cluster:
CODEkubectl get machines
Identify the machine that has the faulty
node-feature-discovery
pod, and then delete that machine:CODEkubectl delete machine <faulty-machine-name-here>
A replacement machine generates and reconciles on your cluster.