You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Cruise Control metrics collector should collect and publish metrics about the Kafka brokers.
Actual Behavior
The Cruise Control metrics collector crashes. The following appears once per minute in the logs of every broker:
[2023-08-16 14:28:31,040] WARN Failed reporting CPU util. (com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter)
java.io.FileNotFoundException: /sys/fs/cgroup/cpu/cpu.cfs_quota_us (No such file or directory)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
at java.base/java.io.FileInputStream.<init>(FileInputStream.java:112)
at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.ContainerMetricUtils.readFile(ContainerMetricUtils.java:62)
at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.ContainerMetricUtils.getCpuQuota(ContainerMetricUtils.java:42)
at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.ContainerMetricUtils.getContainerProcessCpuLoad(ContainerMetricUtils.java:92)
at com.linkedin.kafka.cruisecontrol.metricsreporter.metric.MetricsUtils.getCpuMetric(MetricsUtils.java:409)
at com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.reportCpuUtils(CruiseControlMetricsReporter.java:449)
at com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.run(CruiseControlMetricsReporter.java:367)
at java.base/java.lang.Thread.run(Thread.java:829)
This also has a side effect: Cruise Control doesn't seem to be able to deal with the fact that it is not getting these metrics. It's memory usage grows until it is eventually OOM killed.
Affected Version
Seen on version 0.24.1.
Though this will be a problem on all versions where cruise.control.metrics.reporter.kubernetes.mode gets set to true.
Steps to Reproduce
Deploy a Kubernetes cluster with nodes that have cgroup v2.
Deploy koperator.
Deploy a basic KafkaCluster. Any configuration that also causes Cruise Control to be deployed should work.
Thanks for reporting this, @robinvanderstraeten-klarrio! We've seen this behavior internally but didn't get the chance to create a dedicated GitHub issue
Reading through the Cruise Control issue, it seems that simply removing the cruise.control.metrics.reporter.kubernetes.mode would fix this, but I'm not too knowledgeable about Cruise Control in general and the impact that this would have on a production deployment.
If this would be a good solution, I'd be happy to contribute it.
I don't think we should remove the cruise.control.metrics.reporter.kubernetes.mode configuration, this configuration was added to resolve CPU utilization reporting issue, see #463
Perhaps the best way is to wait for upstream CC to fix their issue with cgroups v2 so we can adapt in Koperator
Description
Cruise control currently does not support running on a cluster with cgroup v2 when the configuration
cruise.control.metrics.reporter.kubernetes.mode
is set to true. (see linkedin/cruise-control#1873)Koperator always sets this to true (https://github.com/banzaicloud/koperator/blob/v0.25.1/pkg/resources/kafka/configmap.go#L105) and AFAIK, there is currently no way to override this configuration.
Expected Behavior
The Cruise Control metrics collector should collect and publish metrics about the Kafka brokers.
Actual Behavior
The Cruise Control metrics collector crashes. The following appears once per minute in the logs of every broker:
This also has a side effect: Cruise Control doesn't seem to be able to deal with the fact that it is not getting these metrics. It's memory usage grows until it is eventually OOM killed.
Affected Version
Seen on version 0.24.1.
Though this will be a problem on all versions where
cruise.control.metrics.reporter.kubernetes.mode
gets set to true.Steps to Reproduce
Checklist
The text was updated successfully, but these errors were encountered: