Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gotk_resource_info metric does not work with kube-prometheus-stack > 55.x #32

Closed
cdenneen opened this issue Apr 9, 2024 · 10 comments · Fixed by #33
Closed

gotk_resource_info metric does not work with kube-prometheus-stack > 55.x #32

cdenneen opened this issue Apr 9, 2024 · 10 comments · Fixed by #33
Assignees

Comments

@cdenneen
Copy link

cdenneen commented Apr 9, 2024

Currently when deployed this works fine with 55.7.0 but the only resource that shows under gotk_resource_info is HelmRelease when running 57.x (57.2.1 contained fix for Info that was broken in 57.2.0 and 58.x as far as I can tell).

@kingdonb kingdonb self-assigned this Apr 10, 2024
@kingdonb
Copy link
Member

kingdonb commented Apr 10, 2024

It looks like kube-prometheus-stack 56.x actually works as well, I just did a little bisect, I see what you're saying, a lot of metrics drop out with the upgrade to 57.x. I tested also 58.x which doesn't fix it, for what it's worth. So there is something for us to work out here.

I will do a little more bisecting and try to find the specific changes in the 57.x chart that caused the breakage. ...

Here is a screenshot of the issue on the cluster stats dashboard... this one is 56.x working fine (57.0.0 also working fine, 57.1.0 also working fine, 57.1.1 also working fine):

Screenshot 2024-04-10 at 10 21 04 AM

then after upgrading to 57.2.0, clearly something is not working anymore:

Screenshot 2024-04-10 at 10 42 50 AM

I had a look in the chart sources to see what changed between 57.1.2 and 57.2.0, it was a bump of the kube-state-metrics chart:

   - name: kube-state-metrics
-    version: "5.16.*"
+    version: "5.18.*"

So the changes that impacted us were in the kube-state-metrics chart, version 5.17 or 5.18.

To be a bit more specific, the kube-state-metrics chart was upgraded from 5.16.4 to 5.18.0

@kingdonb
Copy link
Member

kingdonb commented Apr 10, 2024

I found some Flux user has already reported this issue and it's a problem with kube-state-metrics breaking backwards compatibility somehow unintentionally:

kubernetes/kube-state-metrics#2366

We should perhaps follow this issue:
kubernetes/kube-state-metrics#1973

But it looks like you already got some helpful suggestions here, from a kube-state-metrics maintainer:

prometheus-community/helm-charts#4410 (comment)

Please cross-reference any helpful notes in the future when reporting issues like this, we spend time to reproduce reports, and knowing that the issue has already been acknowledged upstream would have been very helpful to save the time from bisecting and all that! ... so I followed the advice there, and found...

(Note you should use v2.12.0 not 2.12.0 as the tag, or you'll get ImagePullBackOff errors since 2.12.0 is not the name of a published tag) - I'm not recommending you follow these steps in production, but you can confirm the issue is fixed this way rather than waiting for the next release only to find out it doesn't fix anything... I tested it locally and found it seems to fix the issue as is suggested. I got those metrics back I got some metrics and the Flux Cluster Stats dashboard came back to life. But the issue is not totally fixed.

@stefanprodan
Copy link
Member

stefanprodan commented Apr 10, 2024

@kingdonb can you confirm that using the latest release works:

kube-state-metrics:
  image:
    tag: "v2.12.0"

@kingdonb
Copy link
Member

No, I tested with 2.12.0 last and I found it might not work.

We are only reading the metrics for HelmReleases, for some reason, which is the issue as @cdenneen described it in his initial report. When I tested earlier versions, we didn't see any metrics at all for reconcilers.

@stefanprodan stefanprodan changed the title Needs to be updated to support kube-prometheus-stack > 55.x gotk_resource_info metric does not work with kube-prometheus-stack > 55.x Apr 10, 2024
@kingdonb
Copy link
Member

The issue as described in the report:

Screenshot 2024-04-10 at 11 14 21 AM

Only HelmReleases metrics are scraped. So there is still some issue here in the latest kube-state-metrics, maybe not declare victory yet. 😞

@kingdonb
Copy link
Member

This looks wrong in the kube-state-metrics logs:

I0410 15:09:03.442832       1 builder.go:282] "Active resources" activeStoreNames="helm.toolkit.fluxcd.io/v2beta2, Resource=helmreleases,image.toolkit.fluxcd.io/v1beta1, Resource=imageupdateautomations,image.toolkit.fluxcd.io/v1beta2, Resource=imagepolicies,image.toolkit.fluxcd.io/v1beta2, Resource=imagerepositories,kustomize.toolkit.fluxcd.io/v1, Resource=kustomizations,notification.toolkit.fluxcd.io/v1, Resource=receivers,notification.toolkit.fluxcd.io/v1beta3, Resource=alerts,notification.toolkit.fluxcd.io/v1beta3, Resource=providers,source.toolkit.fluxcd.io/v1, Resource=gitrepositories,source.toolkit.fluxcd.io/v1beta2, Resource=buckets,source.toolkit.fluxcd.io/v1beta2, Resource=helmcharts,source.toolkit.fluxcd.io/v1beta2, Resource=helmrepositories,source.toolkit.fluxcd.io/v1beta2, Resource=ocirepositories"

I'll compare to the output from an older version, but it looks like these resource kinds have all been mixed into the activeStoreNames for helm.toolkit.fluxcd.io/v2beta2

@kingdonb
Copy link
Member

That doesn't appear to be any different than the old output with kube-state-metrics v2.10.1 that worked...

I0410 15:29:08.729802       1 builder.go:271] "Active resources" activeStoreNames="helm.toolkit.fluxcd.io/v2beta2, Resource=helmreleases,image.toolkit.fluxcd.io/v1beta1, Resource=imageupdateautomations,image.toolkit.fluxcd.io/v1beta2, Resource=imagepolicies,image.toolkit.fluxcd.io/v1beta2, Resource=imagerepositories,kustomize.toolkit.fluxcd.io/v1, Resource=kustomizations,notification.toolkit.fluxcd.io/v1, Resource=receivers,notification.toolkit.fluxcd.io/v1beta3, Resource=alerts,notification.toolkit.fluxcd.io/v1beta3, Resource=providers,source.toolkit.fluxcd.io/v1, Resource=gitrepositories,source.toolkit.fluxcd.io/v1beta2, Resource=buckets,source.toolkit.fluxcd.io/v1beta2, Resource=helmcharts,source.toolkit.fluxcd.io/v1beta2, Resource=helmrepositories,source.toolkit.fluxcd.io/v1beta2, Resource=ocirepositories"

I can tell you that kube-prometheus-stack chart 57.1.1 with kube-state-metrics chart 5.16.4 was the last one that worked, but that's about as much as I can ascertain from where I'm sitting without spending a bunch more time delving into prom-stack that I don't have right now unfortunately.

@speer
Copy link
Contributor

speer commented Apr 16, 2024

We worked around this issue by upgrading kube-state-metrics to v2.12.0 and by changing the "help" text in the customResourceState config to unique texts such as:
from "The current state of a GitOps Toolkit resource."
to "The current state of a Flux Kustomization resource.", "The current state of a Flux HelmRelease resource.", ...

See also: kubernetes/kube-state-metrics#2366 (comment)

@stefanprodan
Copy link
Member

I see no issue with making the help text specific for each kind, it's actually better. @speer feel free to open a PR with your fix.

@kingdonb
Copy link
Member

The changes in #33 are good and can be merged, (but I don't have permission to merge them)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants