Improve scalability of the EC2/Route53 controllers #2029

justinmir · 2024-04-01T18:04:04Z

Reddit uses provider-aws to primarily manage EC2 and Route53 resources. In the future we may be adopting it for interacting with various AWS managed services. The provider-aws controllers currently manage on the order of several thousand resources each, ~3000+ EC2 instances, and ~6000+ route53 records.

We are currently running provider-aws version 0.46.

What problem are you facing?

At this scale we run into issues with high queue depth in our controllers, even with 20+ active workers (provider-aws --max-reconcile-rate=20).

Figure 1: Crossplane controller is unable to work through queue of reconcile requests even with 20 active workers at our scale.

Reconcile times for EC2 instances typically take greater than one second at median and can take up to 6 seconds at p99.

Figure 2: Reconcile time for instance resources

Without any jitter, ResourceRecordSet resource observations can cause a backlog that can take up to an hour to resolve. AWS rate limits route53 API requests to 5 requests / second / account, which makes it extremely easy to hit the rate limit when performing observation on route53 resources. These queue depths exist even with poll intervals set to 30 minutes (up from 1 minute) in our fork.

Figure 3: Resource record set controller is backlogged during instance observation.

How could Crossplane help solve your problem?

Allow configuring per-resource poll interval / jitter
Introduce jitter in resource record sets, introducing jitter smooths the request rate of the resource due to observations and minimizes the impact of the rate limit. We hard-code jitter in our crossplane provider-aws fork.

Reduce reconcile time for EC2 instance resources
Reduce the time spent to perform an EC2 instance observation by: (1) reduce unnecessary API calls for duplicate data, (2) parallelize API calls where possible.

The text was updated successfully, but these errors were encountered:

davimmt · 2024-06-07T14:43:50Z

Looking forward mainly for the Route53 Record observation fixes. API calls should be merged into big chunks of ResourceChangeBatches at once.

Using xpkg.upbound.io/upbound/provider-aws-route53:v1.2.1 dealing with hundreds of applications: failed to observe the resource: [{0 reading Route 53 Record (ZONEID_DOMAINNAME_A): Throttling: Rate exceeded status code: 400, request id: RID []}].

github-actions · 2024-09-06T02:18:12Z

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

justinmir · 2024-09-13T21:57:09Z

/fresh

github-actions · 2024-12-13T02:43:05Z

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

justinmir added the enhancement New feature or request label Apr 1, 2024

justinmir mentioned this issue Apr 1, 2024

Reduce ec2 instance controller API calls for an observation #2028

Merged

2 tasks

max-melentyev mentioned this issue Apr 4, 2024

Allow to override options for specific controllers #2030

Closed

2 tasks

max-melentyev mentioned this issue Apr 23, 2024

Add monitor for event-based reconciliation #2048

Draft

2 tasks

github-actions bot added the stale label Sep 6, 2024

github-actions bot removed the stale label Sep 14, 2024

github-actions bot added the stale label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scalability of the EC2/Route53 controllers #2029

Improve scalability of the EC2/Route53 controllers #2029

justinmir commented Apr 1, 2024

davimmt commented Jun 7, 2024

github-actions bot commented Sep 6, 2024

justinmir commented Sep 13, 2024

github-actions bot commented Dec 13, 2024

Improve scalability of the EC2/Route53 controllers #2029

Improve scalability of the EC2/Route53 controllers #2029

Comments

justinmir commented Apr 1, 2024

What problem are you facing?

How could Crossplane help solve your problem?

davimmt commented Jun 7, 2024

github-actions bot commented Sep 6, 2024

justinmir commented Sep 13, 2024

github-actions bot commented Dec 13, 2024