-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use better defaults for poll interval and max reconcile rate #203
Use better defaults for poll interval and max reconcile rate #203
Conversation
Signed-off-by: Hasan Turken <[email protected]>
Some metrics from the experiments: 1m, 10: (1k Objects, 10k Objects) 1m, 100: (1k Objects, 10k Objects) 10m, 100: (1k Objects, 10k Objects) One thing to note here is, it looks like we don't get clear signals on responsiveness from metrics like workqueue_depth. I suspect that crossplane runtime's ratelimiting reconciler drops the objects from the queue without really processing them, which prevents observing the number of items waiting to be processed (I'll create a separate issue for that). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @turkenh for the in-depth performance checks. This change seems sensible to me based on the results.
Using the setup linked below, I created 1000 and 10000 Objects with different poll interval and max-reconcile-rate settings. After giving some time for system to stabilize
I have a q about this: "After giving some time for system to stabilize". Did you also notice differences in the time needed for the system to stabilize? I would assume it would be faster with the new values, but maybe we would also be hitting k8s api-server throttling due to more requests?
I realized you can see this on the last graphs, so nvm :) |
After further testing I found a problem with just bumping the poll interval. It works fine for the default I believe we need to find a way to differentiate poll interval from initial readiness. May be something to implement in managed reconciler, e.g. if the object is not ready, requeue with a shorter delay otherwise fallback to poll interval. |
@turkenh the managed resource reconciler itself has no concept of MR readiness. That's implemented inside each For example: WithPollIntervalHook(func (mg resource.Managed, interval time.Duration) time.Duration {
if mg.GetCondition(xpv1.ConditionReady) == corev1.ConditionTrue {
return interval // If it's ready, use the poll interval from the flag.
}
return 30 * time.Second // If it's not, requeue sooner.
}) For this provider I think it would be ideal to support watches rather than relying on polling. I'd certainly feel better about a long poll interval like 10m if I knew it was really just a backup to a watch. That said, I imagine it won't be straightforward to build given right now our Kubernetes clients don't persist across reconciles. I imagine we'd have to switch to using a pool of clients. |
Thanks @negz, this is exactly where I found myself after checking the managed reconciler code. Great that we have this flexibility already. The only thing that I didn't like is, pollJitter implementation already uses that function, so, I had to leak that code here. |
internal/controller/object/object.go
Outdated
if mg.GetCondition(xpv1.TypeReady).Status == v1.ConditionTrue { | ||
// This is the same as runtime default poll interval with jitter, see: | ||
// https://github.com/crossplane/crossplane-runtime/blob/7fcb8c5cad6fc4abb6649813b92ab92e1832d368/pkg/reconciler/managed/reconciler.go#L573 | ||
return pollInterval + time.Duration((rand.Float64()-0.5)*2*float64(pollJitter)) //#nosec G404 -- no need for secure randomness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return pollInterval + time.Duration((rand.Float64()-0.5)*2*float64(pollJitter)) //#nosec G404 -- no need for secure randomness | |
return pollInterval + time.Duration((rand.Float64()-0.5)*2*float64(pollJitter)) //nolint G404 // No need for secure randomness |
Nit: let's use golint style disable comments here (I know this was just copied from upstream).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also, applied jitter to 30 seconds as well.
I think that's fine. |
Signed-off-by: Hasan Turken <[email protected]>
74a00e8
to
c73c4bc
Compare
Description of your changes
We observe that this Provides struggle to process managed resources after some (not that high) scale, no matter how much cpu or memory we allocate for the process. After some investigation, we noticed that there are two primary reasons for this is:
--max-reconcile-rate
parameter, which is being used to configure some other internal parameters, when set to 10, prevents processing more than ~10 resources per second.--poll
parameter with 1m as default, adds all the resources to the queue for drift detection after 1m.We never run scaling experiments for this provider, rather just started with some numbers. This PR tweaks the defaults for poll interval and max reconcile rate parameters based on the following scaling experiments.
Using the setup linked below, I created 1000 and 10000
Objects
with different poll interval and max-reconcile-rate settings. After giving some time for system to stabilize, I created 1 moreObject
and recorded time until it gets processed first time, i.e. Time until First Reconcile, as a measure of system responsiveness.Please note, the proposed defaults for the two configurations are also aligned with the Upbound's official providers, e.g. provider-aws.
I have:
make reviewable test
to ensure this PR is ready for review.How has this code been tested
See the experiment setup here: https://github.com/turkenh/provider-kubernetes-scalability