Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds (Prometheus) ServiceMonitor integration #16

Merged
merged 7 commits into from
Mar 13, 2024

Conversation

mshivanna
Copy link
Contributor

@mshivanna mshivanna commented Feb 20, 2024

This adds (Prometheus) ServiceMonitor integration via values, notably serviceMonitor.enabled.

Here are example values, integration tested via CI

serviceMonitor:
  enabled: true
  interval: 1s
  scrapeTimeout: 1s
  namespace: ci-monitoring

This was very tricky due to test due to..

  • using resources created out-of-band and in a different namespace (kube-prometheus-stack)
  • indirection between service monitor and values in prometheus, such as target scrapePool
  • unpredictable amount of time between creating k8s configuration and it converging.
Key Type Default Description
serviceMonitor.enabled bool false Creates a ServiceMonitor to scrape /prometheus. Requires prometheus-operator
serviceMonitor.namespace string override or release namespace Namespace to create the service monitor in
serviceMonitor.labels object {} Additional metadata labels
serviceMonitor.interval string Prometheus global scrape interval How often to scrape /prometheus. e.g. '5s'
serviceMonitor.scrapeTimeout string Prometheus global scrape timeout Timeout for scraping metrics. e.g. '10s'

@codefromthecrypt
Copy link
Member

thanks for the start!

CI says

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: resource mapping not found for name: "zipkin-4rrm0lckxg" namespace: "zipkin-4rrm0lckxg" from "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"

along with the change here, we need corresponding bits in the schema file and README
to make sure it works, we should also have ci/serviceMonitor-values.yaml

@codefromthecrypt codefromthecrypt changed the title [zipkin-helm] option to enable serviceMonitor for zipkin option to enable serviceMonitor Feb 20, 2024
@mshivanna
Copy link
Contributor Author

ok will fix it

Signed-off-by: Adrian Cole <[email protected]>
@codefromthecrypt
Copy link
Member

ok I fixed the things I mentioned and pushed

@codefromthecrypt
Copy link
Member

tests pass, but I want to see if there's any way to actually test it (vs normal helm chart tests which just make sure it doesn't crash)

@codefromthecrypt
Copy link
Member

OK so current status that ct install passes unless you actually try to use this. This is one reason why I wanted to make sure there is an integration test. @mshivanna can you take a look and see what might be the issue? Basically the test can get the prometheus query endpoint, but there is no data in it from zipkin even after you wait.

# This uses prometheus-operated in the ci-monitoring namespace, from helmfile.yaml.
# Note: The query API returns HTTP 200 on empty, so we grep to ensure something returned.
# See https://prometheus.io/docs/prometheus/latest/querying/api/
args: [ 'sleep 5 && wget -q -O - http://prometheus-operated.ci-monitoring.svc.cluster.local:9090/api/v1/query?query=http_server_requests_seconds_max | grep zipkin' ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can take out the '|grep zipkin' temporarily here to see that the query works, but no data is returned. You might also want to check '/api/v1/targets?scrapePool=zipkin'

@codefromthecrypt
Copy link
Member

so this is the part that fails because no data is returned. it didn't fail due to incorrect endpoint, as that would show something in the console. What failed was the 'grep'

   get-prometheus-query:
    Container ID:  containerd://5fa1db9ae46769445d59949f3c925a424e9f4f35fc8195ebd8fa2120f7140486
    Image:         ghcr.io/openzipkin/alpine:3.19.1
    Image ID:      ghcr.io/openzipkin/alpine@sha256:0269536c808330211eeb9d952ecfc262699038e90162fcb412d7c9ae102061a9
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      sleep 5 && wget -q -O - http://prometheus-operated.ci-monitoring.svc.cluster.local:9090/api/v1/query?query=http_server_requests_seconds_max | grep zipkin
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 11 Mar 2024 09:41:04 +0000
      Finished:     Mon, 11 Mar 2024 09:41:09 +0000

@codefromthecrypt
Copy link
Member

pushed a commit that will pass only because it no longer validates.. to help someone with fresh eyes have a look at what might be up.

@codefromthecrypt
Copy link
Member

so you can see here that there is data in the prom endpoint on zipkin, but it isn't being scraped for some reason.. or made available to prom. That's the problem to solve! Details in the last workflow run

 ==> Logs of container zipkin-ljnno241yb-test-connection
------------------------------------------------------------------------------------------------------------------------
--snip--
http_server_requests_seconds_max{method="GET",status="200",uri="/api/v2/services",} 0.022805119
# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 13.0
--snip--
------------------------------------------------------------------------------------------------------------------------
<== Logs of container zipkin-ljnno241yb-test-connection
------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
==> Logs of container zipkin-ljnno241yb-test-connection
------------------------------------------------------------------------------------------------------------------------
{"status":"success","data":{"resultType":"vector","result":[]}}
------------------------------------------------------------------------------------------------------------------------
<== Logs of container zipkin-ljnno241yb-test-connection
------------------------------------------------------------------------------------------------------------------------
========================================================================================================================

@codefromthecrypt
Copy link
Member

and in case it helps, here's the yaml produced by helm install zipkin charts/zipkin --values charts/zipkin/ci/serviceMonitor-values.yaml

# Source: zipkin/templates/servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: zipkin
  namespace: default
  labels:
    helm.sh/chart: zipkin-0.2.2
    app.kubernetes.io/name: zipkin
    app.kubernetes.io/instance: zipkin
    app.kubernetes.io/version: "3.1.1"
    app.kubernetes.io/managed-by: Helm
spec:
  endpoints:
  - port: http-query
    path: '/prometheus'
    interval: 1s
    scrapeTimeout: 2s
  selector:
    matchLabels:
        app.kubernetes.io/name: zipkin
        app.kubernetes.io/instance: zipkin
  namespaceSelector:
    matchNames:
      - default

and after port forwarding like kubectl port-forward service/prometheus-operated 9090:9090 -n ci-monitoring

zipkin doesn't show up in the scrape pool

curl -s localhost:9090/api/v1/targets?scrapePool=zipkin|jq .
{
  "status": "success",
  "data": {
    "activeTargets": [],
    "droppedTargets": [],
    "droppedTargetCounts": {
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-apiserver/0": 0,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-coredns/0": 9,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-kube-controller-manager/0": 10,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-kube-etcd/0": 10,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-kube-proxy/0": 10,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-kube-scheduler/0": 10,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-kubelet/0": 9,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-kubelet/1": 9,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-kubelet/2": 9,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-operator/0": 5,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-prometheus/0": 5,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-prom-prometheus/1": 5,
      "serviceMonitor/ci-monitoring/prometheus-stack-kube-state-metrics/0": 5
    }
  }
}

@codefromthecrypt
Copy link
Member

thought I got it, but I didn't. I noticed prometheus .spec.serviceMonitorSelector setup in helmfile.yaml is 'release: prometheus-stack' and added that label, but yeah didn't work anyway.

help wanted!

@codefromthecrypt codefromthecrypt added the help wanted Extra attention is needed label Mar 11, 2024
@codefromthecrypt codefromthecrypt removed the help wanted Extra attention is needed label Mar 12, 2024
@codefromthecrypt codefromthecrypt self-assigned this Mar 12, 2024
Signed-off-by: Adrian Cole <[email protected]>
@codefromthecrypt codefromthecrypt changed the title option to enable serviceMonitor Adds service Mar 12, 2024
@codefromthecrypt codefromthecrypt changed the title Adds service Adds (Prometheus) ServiceMonitor integration Mar 12, 2024
Copy link
Member

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready to go. PTAL @anuraaga @reta to see if you understand my notes

@codefromthecrypt codefromthecrypt requested a review from reta March 12, 2024 08:12
@@ -0,0 +1,47 @@
{{- /*
Copyright 2024 The OpenZipkin Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: later we can switch everything to SPDX, so I didn't do it in this PR

Signed-off-by: Adrian Cole <[email protected]>
serviceMonitor:
enabled: true
interval: 1s
scrapeTimeout: 1s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps I don't know if this is normal in k8s, but if you make an invalid config, like scrapeTimeout > interval, the service monitor will be created, but just won't ever be processed. You end up having to look at prometheus-operator pod logs to figure it out. I don't know if this is a bug or a norm.. if someone thinks this is a bug, probably needs to be raised upstream as hours lost over this.

- name: prometheus-community
url: https://prometheus-community.github.io/helm-charts

# Prometheus is too much to configure manually in a test yaml. We need the CRD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a test yaml so we use helm.

Maybe, I wasn't quite sure what the intention is of this comment, made a guess

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good guess

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, rewrote!

Signed-off-by: Adrian Cole <[email protected]>
@codefromthecrypt codefromthecrypt merged commit 67d383f into openzipkin:master Mar 13, 2024
1 check passed
@codefromthecrypt
Copy link
Member

thanks for the idea and initial commit @mshivanna! thanks for the review help here and behind the curtain @anuraaga!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants