-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kodiak missing from checks #874
Comments
I've been looking at this for a while and I think the issue is the number of web hook events in the installation's queue. Kodiak has a single queue per installation to process all web hooks. Whenever a PR is merged, Kodiak enqueues all open PRs (up to 100 per repo) to recheck merge status. So if you have repos with a large number of open PRs, there's going to be a large number of PRs added to the queue. There's currently 321 PRs in the queue for evaluation, so Kodiak will take some time to evaluate those based on GitHub API rate limits. I wonder if we can filter out PRs more here or increase concurrency: Lines 214 to 215 in d6f1720
|
I think the move going forward is to update Kodiak to have more tasks working on the web hook queue. If we have rate limit issues, we can figure something else out |
Is it repo or org? Because the repo only has 56 open PRs which is less than the "100 per repo". |
I tested it and confirmed that all 56 return in a single api response:
|
It’s per organization, so it can add up across repos |
Since PRs are technically Issues, you could change the API call from For example, this returns the 1 open PR that has the https://api.github.com/repos/vercel/vercel/issues?per_page=100&labels=pr%3A+automerge Now that only works when the repo is configured to include the |
My plan is to increase the concurrency for web hook ingestion |
I've deployed a change to increase the concurrency for your organization to 5 ingest workers. I'm going to monitor the GitHub API rate limits and increase the worker count if the rate limiting isn't an issue |
I realized yesterday that Kodiak throttles API requests to 5000/hour by default. Since your organization is on GitHub Enterprise, Kodiak has a 15,000/hour rate limit. I've deployed a change that enables your organization to utilize the 15k rate limit |
Wonderful thanks! I'll close this issue for now and I can re-open if it happens again 👍 |
Re-opening since the problem is happening again now. Kodiak shows nothing in the queue but there is a PR that says its 3rd in the queue for about 12 minutes now. |
It might be the rate limit, the logging on rate limits is lacking, so I'll turn down the internal Kodiak limit for now |
I added some better logging, it seems like we're hitting a different rate limit. I'm seeing errors for GraphQL queries (
|
I think we can update our throttling logic to wait for the "retry-after" time when we hit a secondary rate limit |
I'm actually unsure here, looking at the I think it may be because we're making many requests concurrently, which the GitHub docs explicitly call out as a no-no 😬 : https://docs.github.com/en/rest/guides/best-practices-for-integrators?apiVersion=2022-11-28#dealing-with-secondary-rate-limits
We could wait 60 seconds after hitting this limit, but that's going to slow things down {
"resources": {
"core": {
"limit": 15000,
"used": 3,
"remaining": 14997,
"reset": 1688135146
},
"search": {
"limit": 30,
"used": 0,
"remaining": 30,
"reset": 1688131627
},
"graphql": {
"limit": 12500,
"used": 120,
"remaining": 12380,
"reset": 1688135139
},
"integration_manifest": {
"limit": 5000,
"used": 0,
"remaining": 5000,
"reset": 1688135167
},
"source_import": {
"limit": 100,
"used": 0,
"remaining": 100,
"reset": 1688131627
},
"code_scanning_upload": {
"limit": 1000,
"used": 0,
"remaining": 1000,
"reset": 1688135167
},
"actions_runner_registration": {
"limit": 10000,
"used": 0,
"remaining": 10000,
"reset": 1688135167
},
"scim": {
"limit": 15000,
"used": 0,
"remaining": 15000,
"reset": 1688135167
},
"dependency_snapshots": {
"limit": 100,
"used": 0,
"remaining": 100,
"reset": 1688131627
},
"code_search": {
"limit": 10,
"used": 0,
"remaining": 10,
"reset": 1688131627
}
},
"rate": {
"limit": 15000,
"used": 3,
"remaining": 14997,
"reset": 1688135146
}
}
|
Why 60 seconds? The linked docs say 1 second.
|
Oh, I was just saying that because that’s what the retry-after header said |
I suppose if the rate limit is hit, then yes it will have to wait for the duration of But maybe setting a delay of 1 sec between requests will ensure the rate limit is never hit? |
This one has been ready to merge for about 20 min vercel/vercel#10164 |
Thanks for the ping here. I think I need to synchronize the graphql calls so we only make one per installation at a time. It seems like the REST API isn't returning the secondary rate limit errors. I triggered a refresh of that pull request and it merged (you can do this by editing the pull request title, description, labels, or commits). Kodiak wasn't actively looking at this PR, I think Kodiak hit the secondary rate limit and dropped the event, because we don't retry on those errors. My TODOs are:
|
Now I'm seeing a new error message. Perhaps GitHub's API is down? Reference: vercel/next.js#52178 |
I think that's because of the secondary rate limit, I'm going to try to temporarily reduce the graphql requests by limiting the ingest workers to 1 |
Hey @styfle, Sorry for the delayed reply. Last week after my message above I made a change to requeue pull requests on api rate limits, which I think ensures we no longer drop pull requests for these limits I’ll take a look at the PR you linked |
This error is unrelated to this original issue https://github.com/vercel/next.js/runs/14992559502 Here’s the issue for this error: #853 I talked to GitHub a while back and the product team there said this was expected behavior |
Is this the same bug as #853 ?
|
Closing since it seems the remaining issues we've had are the same as #853 |
Kodiak is missing from the checks on this PR: vercel/vercel#10133
Even though the automerge label is applied, it seems like it will never merge https://github.com/vercel/vercel/blob/main/.kodiak.toml#L4C1-L4C1
Is kodiak down right now?
The text was updated successfully, but these errors were encountered: