Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(prometheus): Removing prometheus recording rule metrics #3211

Merged
merged 3 commits into from
Sep 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .changelog/3211.changed.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
feat(prometheus): Removing prometheus recording rules
1 change: 0 additions & 1 deletion deploy/helm/sumologic/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,6 @@ The following table lists the configurable parameters of the Sumo Logic chart an
| `kube-prometheus-stack.prometheus-node-exporter.nodeSelector` | Node selector for prometheus node exporter. [See docs/best-practices.md for more information.](/docs/best-practices.md) | `{}` |
| `kube-prometheus-stack.kube-state-metrics.nodeSelector` | Node selector for kube-state-metrics. [See docs/best-practices.md for more information.](/docs/best-practices.md) | `{}` |
| `kube-prometheus-stack.kube-state-metrics.image.tag` | Tag for kube-state-metrics Docker image. | `v2.7.0` |
| `kube-prometheus-stack.additionalPrometheusRulesMap` | Custom recording or alerting rules to be deployed into the cluster | See [values.yaml] |
| `kube-prometheus-stack.commonLabels` | Labels to apply to all Kube Prometheus Stack resources | `{}` |
| `kube-prometheus-stack.coreDns.serviceMonitor.interval` | Core DNS metrics scrape interval. If not set, the Prometheus default scrape interval is used. | `Nil` |
| `kube-prometheus-stack.coreDns.serviceMonitor.metricRelabelings` | Core DNS MetricRelabelConfigs | See [values.yaml] |
Expand Down
244 changes: 4 additions & 240 deletions deploy/helm/sumologic/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -741,7 +741,7 @@ kube-prometheus-stack:
kubelet: false
kubeProxy: false
kubePrometheusGeneral: false
kubePrometheusNodeRecording: true
kubePrometheusNodeRecording: false
kubernetesApps: false
kubernetesResources: false
kubernetesStorage: false
Expand All @@ -750,205 +750,13 @@ kube-prometheus-stack:
kubeSchedulerRecording: false
kubeStateMetrics: false
network: false
node: true
node: false
nodeExporterAlerting: false
nodeExporterRecording: false
prometheus: false
prometheusOperator: false
windows: false

## k8s pre-1.14 prometheus recording rules
additionalPrometheusRulesMap:
pre-1.14-node-rules:
groups:
- name: node-pre-1.14.rules
rules:
- expr: sum(min(kube_pod_info) by (node))
record: ":kube_pod_info_node_count:"
- expr: 1 - avg(rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m]))
record: :node_cpu_utilisation:avg1m
- expr: |-
1 - avg by (node) (
rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m])
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:)
record: node:node_cpu_utilisation:avg1m
- expr: |-
1 -
sum(
node_memory_MemFree_bytes{job="node-exporter"} +
node_memory_Cached_bytes{job="node-exporter"} +
node_memory_Buffers_bytes{job="node-exporter"}
)
/
sum(node_memory_MemTotal_bytes{job="node-exporter"})
record: ":node_memory_utilisation:"
- expr: |-
sum by (node) (
(
node_memory_MemFree_bytes{job="node-exporter"} +
node_memory_Cached_bytes{job="node-exporter"} +
node_memory_Buffers_bytes{job="node-exporter"}
)
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
record: node:node_memory_bytes_available:sum
- expr: |-
(node:node_memory_bytes_total:sum - node:node_memory_bytes_available:sum)
/
node:node_memory_bytes_total:sum
record: node:node_memory_utilisation:ratio
- expr: |-
1 -
sum by (node) (
(
node_memory_MemFree_bytes{job="node-exporter"} +
node_memory_Cached_bytes{job="node-exporter"} +
node_memory_Buffers_bytes{job="node-exporter"}
)
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
/
sum by (node) (
node_memory_MemTotal_bytes{job="node-exporter"}
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
record: "node:node_memory_utilisation:"
- expr: 1 - (node:node_memory_bytes_available:sum / node:node_memory_bytes_total:sum)
record: "node:node_memory_utilisation_2:"
- expr: |-
max by (instance, namespace, pod, device) ((node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}
- node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"})
/ node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"})
record: "node:node_filesystem_usage:"
- expr: |-
sum by (node) (
node_memory_MemTotal_bytes{job="node-exporter"}
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
record: node:node_memory_bytes_total:sum
- expr: |-
sum(irate(node_network_receive_bytes_total{job="node-exporter",device!~"veth.+"}[1m])) +
sum(irate(node_network_transmit_bytes_total{job="node-exporter",device!~"veth.+"}[1m]))
record: :node_net_utilisation:sum_irate
- expr: |-
sum by (node) (
(irate(node_network_receive_bytes_total{job="node-exporter",device!~"veth.+"}[1m]) +
irate(node_network_transmit_bytes_total{job="node-exporter",device!~"veth.+"}[1m]))
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
record: node:node_net_utilisation:sum_irate
- expr: |-
sum(irate(node_network_receive_drop_total{job="node-exporter",device!~"veth.+"}[1m])) +
sum(irate(node_network_transmit_drop_total{job="node-exporter",device!~"veth.+"}[1m]))
record: :node_net_saturation:sum_irate
- expr: |-
sum by (node) (
(irate(node_network_receive_drop_total{job="node-exporter",device!~"veth.+"}[1m]) +
irate(node_network_transmit_drop_total{job="node-exporter",device!~"veth.+"}[1m]))
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
record: node:node_net_saturation:sum_irate
- expr: |-
sum(node_load1{job="node-exporter"})
/
sum(node:node_num_cpu:sum)
record: ":node_cpu_saturation_load1:"
- expr: avg(irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m]))
record: :node_disk_saturation:avg_irate
- expr: |-
avg by (node) (
irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m])
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
record: node:node_disk_saturation:avg_irate
- expr: avg(irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m]))
record: :node_disk_utilisation:avg_irate
- expr: |-
avg by (node) (
irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m])
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
record: node:node_disk_utilisation:avg_irate
- expr: |-
1e3 * sum(
(rate(node_vmstat_pgpgin{job="node-exporter"}[1m])
+ rate(node_vmstat_pgpgout{job="node-exporter"}[1m]))
)
record: :node_memory_swap_io_bytes:sum_rate
- expr: |-
1e3 * sum by (node) (
(rate(node_vmstat_pgpgin{job="node-exporter"}[1m])
+ rate(node_vmstat_pgpgout{job="node-exporter"}[1m]))
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
record: node:node_memory_swap_io_bytes:sum_rate
- expr: |-
node:node_cpu_utilisation:avg1m
*
node:node_num_cpu:sum
/
scalar(sum(node:node_num_cpu:sum))
record: node:cluster_cpu_utilisation:ratio
- expr: |-
(node:node_memory_bytes_total:sum - node:node_memory_bytes_available:sum)
/
scalar(sum(node:node_memory_bytes_total:sum))
record: node:cluster_memory_utilisation:ratio
- expr: |-
sum by (node) (
node_load1{job="node-exporter"}
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
/
node:node_num_cpu:sum
record: "node:node_cpu_saturation_load1:"
- expr: |-
max by (instance, namespace, pod, device) (
node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}
/
node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}
)
record: "node:node_filesystem_avail:"
- expr: |-
max(
max(
kube_pod_info{job="kube-state-metrics", host_ip!=""}
) by (node, host_ip)
* on (host_ip) group_right (node)
label_replace(
(
max(node_filesystem_files{job="node-exporter", mountpoint="/"})
by (instance)
), "host_ip", "$1", "instance", "(.*):.*"
)
) by (node)
record: "node:node_inodes_total:"
- expr: |-
max(
max(
kube_pod_info{job="kube-state-metrics", host_ip!=""}
) by (node, host_ip)
* on (host_ip) group_right (node)
label_replace(
(
max(node_filesystem_files_free{job="node-exporter", mountpoint="/"})
by (instance)
), "host_ip", "$1", "instance", "(.*):.*"
)
) by (node)
record: "node:node_inodes_free:"

## NOTE changing the serviceMonitor scrape interval to be >1m can result in metrics from recording
## rules to be missing and empty panels in Sumo Logic Kubernetes apps.
kubeApiServer:
Expand Down Expand Up @@ -1412,53 +1220,9 @@ kube-prometheus-stack:
remoteTimeout: 5s
writeRelabelConfigs:
- action: keep
regex: node-exporter;(?:node_load1|node_load5|node_load15|node_cpu_seconds_total)
regex: node-exporter;(?:node_load1|node_load5|node_load15|node_cpu_seconds_total|node_disk_io_time_weighted_seconds_total|node_disk_io_time_seconds_total|node_vmstat_pgpgin|node_vmstat_pgpgout|node_memory_MemFree_bytes|node_memory_Cached_bytes|node_memory_Buffers_bytes|node_memory_MemTotal_bytes|node_network_receive_drop_total|node_network_transmit_drop_total|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_filesystem_avail_bytes|node_filesystem_size_bytes)
sourceLabels: [job, __name__]
## prometheus operator rules
## :kube_pod_info_node_count:
## :node_cpu_saturation_load1:
## :node_cpu_utilisation:avg1m
## :node_disk_saturation:avg_irate
## :node_disk_utilisation:avg_irate
## :node_memory_swap_io_bytes:sum_rate
## :node_memory_utilisation:
## :node_net_saturation:sum_irate
## :node_net_utilisation:sum_irate
## cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
## cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
## cluster_quantile:scheduler_framework_extension_point_duration_seconds:histogram_quantile
## cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
## cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
## instance:node_filesystem_usage:sum # no rules definition found
## instance:node_network_receive_bytes:rate:sum
## node:cluster_cpu_utilisation:ratio
## node:cluster_memory_utilisation:ratio
## node:node_cpu_saturation_load1:
## node:node_cpu_utilisation:avg1m
## node:node_disk_saturation:avg_irate
## node:node_disk_utilisation:avg_irate
## node:node_filesystem_avail:
## node:node_filesystem_usage:
## node:node_inodes_free:
## node:node_inodes_total:
## node:node_memory_bytes_total:sum
## node:node_memory_swap_io_bytes:sum_rate
## node:node_memory_utilisation:
## node:node_memory_utilisation:ratio
## node:node_memory_utilisation_2:
## node:node_net_saturation:sum_irate
## node:node_net_utilisation:sum_irate
## node:node_num_cpu:sum
## node_namespace_pod:kube_pod_info:
- url: http://$(METADATA_METRICS_SVC).$(NAMESPACE).svc.cluster.local.:9888/prometheus.metrics.operator.rule
remoteTimeout: 5s
writeRelabelConfigs:
- action: drop
regex: ^true$
sourceLabels: [_sumo_forward_]
- action: keep
regex: "cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|instance:node_filesystem_usage:sum|instance:node_network_receive_bytes:rate:sum|cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile|cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile|cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile|cluster_quantile:scheduler_framework_extension_point_duration_seconds:histogram_quantile|node_namespace_pod:kube_pod_info:|:kube_pod_info_node_count:|node:node_num_cpu:sum|:node_cpu_utilisation:avg1m|node:node_cpu_utilisation:avg1m|node:cluster_cpu_utilisation:ratio|:node_cpu_saturation_load1:|node:node_cpu_saturation_load1:|:node_memory_utilisation:|node:node_memory_bytes_total:sum|node:node_memory_utilisation:ratio|node:cluster_memory_utilisation:ratio|:node_memory_swap_io_bytes:sum_rate|node:node_memory_utilisation:|node:node_memory_utilisation_2:|node:node_memory_swap_io_bytes:sum_rate|:node_disk_utilisation:avg_irate|node:node_disk_utilisation:avg_irate|:node_disk_saturation:avg_irate|node:node_disk_saturation:avg_irate|node:node_filesystem_usage:|node:node_filesystem_avail:|:node_net_utilisation:sum_irate|node:node_net_utilisation:sum_irate|:node_net_saturation:sum_irate|node:node_net_saturation:sum_irate|node:node_inodes_total:|node:node_inodes_free:"
sourceLabels: [__name__]

## Nginx ingress controller metrics
## rel: https://docs.nginx.com/nginx-ingress-controller/logging-and-monitoring/prometheus/#available-metrics
## nginx_ingress_controller_ingress_resources_total
Expand Down
Loading