You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a setup where a central admin cluster running Argo CD is managing Applications on a fleet of workload clusters. Our central admin cluster connects to the worker clusters through Anthos Fleet and the Connect Gateway. The worker clusters are a mix of Anthos Bare Metal and GKE.
We ran into an issue where we hit the default Connect Gateway API quota with only two registered workload clusters and a handful of deployed Applications. Investigation showed that Argo CD was performing full API discovery requests on registered workload clusters multiple times per minute. Further investigation led us to this event loop in gitops-engine, where c.startMissingWatches() performs a non-cached discovery of the target cluster each time a CRD changes.
This turns out to be problematic for GKE clusters with Backup for GKE enabled, because the system-provided addon-manager will patch its CRDs very often:
Looking at the Connect Gateway API traffic you can see the sharp drop when we added a resource exclusion on gkebackup.gke.io/*
(PS: The last sudden spike was caused by us temporarily removing the resource exclusion)
For our use case, having gkebackup.gke.io/* excluded is totally fine. We contacted Google Support about the issue, and the rapidly patched CRDs is intended behaviour. Their immediate response to the chatty nature of Argo CD was to just raise the quota for affected customers.
Writing up this issue because it might not be very evident for people running Argo or Flux targeting GKE clusters unless they have good visibility into their API server traffic.
Hi!
We have a setup where a central admin cluster running Argo CD is managing Applications on a fleet of workload clusters. Our central admin cluster connects to the worker clusters through Anthos Fleet and the Connect Gateway. The worker clusters are a mix of Anthos Bare Metal and GKE.
We ran into an issue where we hit the default Connect Gateway API quota with only two registered workload clusters and a handful of deployed Applications. Investigation showed that Argo CD was performing full API discovery requests on registered workload clusters multiple times per minute. Further investigation led us to this event loop in gitops-engine, where
c.startMissingWatches()
performs a non-cached discovery of the target cluster each time a CRD changes.This turns out to be problematic for GKE clusters with Backup for GKE enabled, because the system-provided
addon-manager
will patch its CRDs very often:Looking at the Connect Gateway API traffic you can see the sharp drop when we added a resource exclusion on
gkebackup.gke.io/*
(PS: The last sudden spike was caused by us temporarily removing the resource exclusion)
For our use case, having
gkebackup.gke.io/*
excluded is totally fine. We contacted Google Support about the issue, and the rapidly patched CRDs is intended behaviour. Their immediate response to the chatty nature of Argo CD was to just raise the quota for affected customers.Writing up this issue because it might not be very evident for people running Argo or Flux targeting GKE clusters unless they have good visibility into their API server traffic.
Possibly related:
Update 05-31-2023:
Just heard back from the product team for Backup for GKE and a fix for the rapidly patched CRDs will be rolled out next week.
The text was updated successfully, but these errors were encountered: