Scaling the Logging Stack
Introduction
Depending on the application workloads you run on your clusters, you may find that the default settings for the DKP logging stack do not meet your needs. In particular, if your workloads produce lots of log traffic, you may find you need to adjust the logging stack components to properly capture all the log traffic. Follow the suggestions below to tune the logging stack components as needed.
Logging Operator
In a high log traffic environment, fluentd
usually becomes the bottleneck of the logging stack. According to https://banzaicloud.com/docs/one-eye/logging-operator/operation/scaling/:
The typical sign of this is when fluentd
cannot handle its buffer directory size growth for more than the configured or calculated (timekey + timekey_wait) flush interval.
For metrics to monitor, please refer to https://docs.fluentd.org/monitoring-fluentd/monitoring-prometheus#metrics-to-monitor.
Grafana dashboard
In DKP, if the Prometheus Monitoring
(kube-prometheus-stack) platform application is enabled, you can view the Logging Operator dashboard in the Grafana UI.
You can also improve fluentd throughput by disabling the buffering for loki
clusterOutput.
Example Configuration
You can see an example configuration of the logging operator in Logging Stack Application Sizing Recommendations.
For more information, refer to:
Grafana Loki
DKP deploys Loki in Microservice mode – this provides you with the highest flexibility in terms of scaling.
In a high log traffic environment, we recommend:
Ingester should be the first component to be considered for scaling up.
Distributor should be scaled up only when the existing Distributor is experiencing stress due to high computing resource usage.
Usually, the number of Distributor pods should be much lower than the number of Ingester pods
Grafana dashboard
In DKP, if Prometheus Monitoring
(kube-prometheus-stack) platform app is enabled, you can view the Loki dashboards in Grafana UI and here is one of the Loki dashboard:
Example Configuration
You can see an example config of Loki at Logging Stack Application Sizing Recommendations.
For more information, refer to:
https://grafana.com/docs/loki/latest/fundamentals/architecture/components/
https://grafana.com/docs/loki/latest/operations/scalability/
Rook Ceph
Ceph is the default S3 storage provider. In DKP, a Rook Ceph Operator and a Rook Ceph Cluster are deployed together to have a Ceph Cluster.
Storage
The default configuration of Rook Ceph Cluster in DKP has a 33% overhead in data storage for redundancy. Meaning, if the data disks allocated for your Rook Ceph Cluster is 1000Gb, 750Gb will be used to store your data. Thus, it is important to account for that in planning the capacity of your data disks to prevent issues.
ObjectBucketClaim storage limit
ObjectBucketClaim has a storage limit option to prevent your S3 bucket from growing over a limit. In DKP this is enabled by default.
Thus, after you size up your Rook Ceph Cluster for more storage, it is important to also increase the storage limit of your ObjectBucketClaims of your grafana-loki
and/or project-grafana-loki
.
To change it for grafana-loki
, please provide an override configmap in rook-ceph-cluster
platform app to override dkp.grafana-loki.maxSize
To change it for project-grafana-loki
, please provide an override configmap in project-grafana-loki
platform app to override dkp.project-grafana-loki.maxSize
Example Configuration
You can see an example config at Rook Ceph Cluster Sizing Recommendations.
Ceph OSD CPU considerations
ceph-osd is the object storage daemon for the Ceph distributed file system. It is responsible for storing objects on a local file system and providing access to them over the network.
If you determine that the Ceph OSD component is the bottleneck, then you may wish to consider increasing the CPU allocated to it.
See this page for more info: https://ceph.io/en/news/blog/2022/ceph-osd-cpu-scaling/
Grafana dashboard
In DKP, if the Prometheus Monitoring
(kube-prometheus-stack) platform app is enabled, you can view the Ceph dashboards in the Grafana UI. Below is one of the Ceph dashboard:
Audit Log
Overhead
Enabling audit logging requires additional computing and storage resources.
When you enable audit logging by enabling the kommander-fluent-bit
AppDeployment, inbound traffic to the logging stack increases the log traffic by approximately 3-4 more times.
Thus, when enabling the audit log, consider scaling up all components in the logging stack mentioned above.
Fine-tuning audit log Fluent Bit
If you are certain that you only need to collect a subset of the logs that the default config makes the kommander-fluent-bit
pods collect, you can add your own override configmap to kommander-fluent-bit
with proper Fluent Bit INPUT
, FILTER
, OUTPUT
settings. This helps reduce the audit log traffic.
To see the default configuration of Fluent Bit, see the Release Notes > Components and Applications.
For more information, refer to: