Missing ContainerInsights Metrics from EKS Cluster when collection_interval is set to 300s #2887

kaveri-s · 2024-11-08T18:15:50Z

The Bug
As the title mentions, we are missing ContainerInsights metrics when collection_interval is set to 300s, particularly the cpu metrics.

We followed the instructions outlined in the documentation resulting in a configmap which looks similar to the eks example infra file. Attaching only the configMap here for reference:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-agent-conf
  namespace: aws-otel-eks
  labels:
    app: opentelemetry
    component: otel-agent-conf
data:
  otel-agent-config: |
    extensions:
      health_check:

    receivers:
      awscontainerinsightreceiver:
        collection_interval: 60s

    processors:
      batch/metrics:
        timeout: 60s

    exporters:
      awsemf:
        namespace: ContainerInsights
        log_group_name: '/aws/containerinsights/{ClusterName}/performance'
        log_stream_name: '{NodeName}'
        resource_to_telemetry_conversion:
          enabled: true
        dimension_rollup_option: NoDimensionRollup
        parse_json_encoded_attr_values: [Sources, kubernetes]
        metric_declarations:
          # node metrics
          - dimensions: [[NodeName, InstanceId, ClusterName]]
            metric_name_selectors:
              - node_cpu_utilization
              - node_memory_utilization
              - node_network_total_bytes
              - node_cpu_reserved_capacity
              - node_memory_reserved_capacity
              - node_number_of_running_pods
              - node_number_of_running_containers
          - dimensions: [[ClusterName]]
            metric_name_selectors:
              - node_cpu_utilization
              - node_memory_utilization
              - node_network_total_bytes
              - node_cpu_reserved_capacity
              - node_memory_reserved_capacity
              - node_number_of_running_pods
              - node_number_of_running_containers
              - node_cpu_usage_total
              - node_cpu_limit
              - node_memory_working_set
              - node_memory_limit

          # pod metrics
          - dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
            metric_name_selectors:
              - pod_cpu_utilization
              - pod_memory_utilization
              - pod_network_rx_bytes
              - pod_network_tx_bytes
              - pod_cpu_utilization_over_pod_limit
              - pod_memory_utilization_over_pod_limit
          - dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
            metric_name_selectors:
              - pod_cpu_reserved_capacity
              - pod_memory_reserved_capacity
          - dimensions: [[PodName, Namespace, ClusterName]]
            metric_name_selectors:
              - pod_number_of_container_restarts

          # cluster metrics
          - dimensions: [[ClusterName]]
            metric_name_selectors:
              - cluster_node_count
              - cluster_failed_node_count

          # service metrics
          - dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
            metric_name_selectors:
              - service_number_of_running_pods

          # node fs metrics
          - dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
            metric_name_selectors:
              - node_filesystem_utilization

          # namespace metrics
          - dimensions: [[Namespace, ClusterName], [ClusterName]]
            metric_name_selectors:
              - namespace_number_of_running_pods

    service:
      pipelines:
        metrics:
          receivers: [awscontainerinsightreceiver]
          processors: [batch/metrics]
          exporters: [awsemf]

      extensions: [health_check]

Steps to reproduce
Set the collection interval to 60s, record all the metrics for 15 minutes (duration) at least
Set collection_interval to 300s, record all the metrics for 15 minutes (duration) at least

Result
At an interval of 60 seconds the number of cpu related metrics and memory related metrics will be on par with each other.
However, when the collection interval is set to 300s, there is a sharp decline in the number of cpu related metrics compared to the memory related metrics, sometimes even down to 1 every 15 minutes.

Additional context
We have worked with AWS Premium support and the team verified that they are seeing the same issue on their test EKS cluster as well and we are creating an issue in this github repo based on their recommendation.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing ContainerInsights Metrics from EKS Cluster when collection_interval is set to 300s #2887

Missing ContainerInsights Metrics from EKS Cluster when collection_interval is set to 300s #2887

kaveri-s commented Nov 8, 2024

Missing ContainerInsights Metrics from EKS Cluster when collection_interval is set to 300s #2887

Missing ContainerInsights Metrics from EKS Cluster when collection_interval is set to 300s #2887

Comments

kaveri-s commented Nov 8, 2024