Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kodiak missing from checks #874

Closed
styfle opened this issue Jun 23, 2023 · 27 comments
Closed

Kodiak missing from checks #874

styfle opened this issue Jun 23, 2023 · 27 comments
Labels
bug Something isn't working

Comments

@styfle
Copy link

styfle commented Jun 23, 2023

Kodiak is missing from the checks on this PR: vercel/vercel#10133

Even though the automerge label is applied, it seems like it will never merge https://github.com/vercel/vercel/blob/main/.kodiak.toml#L4C1-L4C1

Is kodiak down right now?

screenshot
@styfle styfle added the bug Something isn't working label Jun 23, 2023
@chdsbd
Copy link
Owner

chdsbd commented Jun 23, 2023

I've been looking at this for a while and I think the issue is the number of web hook events in the installation's queue.

Kodiak has a single queue per installation to process all web hooks. Whenever a PR is merged, Kodiak enqueues all open PRs (up to 100 per repo) to recheck merge status. So if you have repos with a large number of open PRs, there's going to be a large number of PRs added to the queue.

There's currently 321 PRs in the queue for evaluation, so Kodiak will take some time to evaluate those based on GitHub API rate limits.

I wonder if we can filter out PRs more here or increase concurrency:

kodiak/bot/kodiak/queue.py

Lines 214 to 215 in d6f1720

prs = await api_client.get_open_pull_requests(base=branch_name)
if prs is None:

@chdsbd
Copy link
Owner

chdsbd commented Jun 23, 2023

I think the move going forward is to update Kodiak to have more tasks working on the web hook queue. If we have rate limit issues, we can figure something else out

@styfle
Copy link
Author

styfle commented Jun 23, 2023

up to 100 per repo to recheck merge status. So if you have repos with a large number of open PRs, there's going to be a large number of PRs added to the queue.

Is it repo or org? Because the repo only has 56 open PRs which is less than the "100 per repo".

https://github.com/vercel/vercel/pulls?q=is%3Apr+is%3Aopen

@styfle
Copy link
Author

styfle commented Jun 23, 2023

I tested it and confirmed that all 56 return in a single api response:

curl -sSL 'https://api.github.com/repos/vercel/vercel/pulls?per_page=100' | jq '. | length'

@chdsbd
Copy link
Owner

chdsbd commented Jun 23, 2023

It’s per organization, so it can add up across repos

@styfle
Copy link
Author

styfle commented Jun 23, 2023

Since PRs are technically Issues, you could change the API call from /pulls to /issues and filter by the label.

For example, this returns the 1 open PR that has the pr: automerge label:

https://api.github.com/repos/vercel/vercel/issues?per_page=100&labels=pr%3A+automerge

Now that only works when the repo is configured to include the automerge_label but that could reduce the number in the queue quite a bit.

@chdsbd
Copy link
Owner

chdsbd commented Jun 26, 2023

My plan is to increase the concurrency for web hook ingestion

@chdsbd
Copy link
Owner

chdsbd commented Jun 27, 2023

I've deployed a change to increase the concurrency for your organization to 5 ingest workers. I'm going to monitor the GitHub API rate limits and increase the worker count if the rate limiting isn't an issue

@chdsbd
Copy link
Owner

chdsbd commented Jun 28, 2023

I realized yesterday that Kodiak throttles API requests to 5000/hour by default. Since your organization is on GitHub Enterprise, Kodiak has a 15,000/hour rate limit.

I've deployed a change that enables your organization to utilize the 15k rate limit

@styfle
Copy link
Author

styfle commented Jun 28, 2023

Wonderful thanks!

I'll close this issue for now and I can re-open if it happens again 👍

@styfle styfle closed this as completed Jun 28, 2023
@styfle
Copy link
Author

styfle commented Jun 28, 2023

Re-opening since the problem is happening again now.

Kodiak shows nothing in the queue but there is a PR that says its 3rd in the queue for about 12 minutes now.

image image

@styfle styfle reopened this Jun 28, 2023
@chdsbd
Copy link
Owner

chdsbd commented Jun 29, 2023

It might be the rate limit, the logging on rate limits is lacking, so I'll turn down the internal Kodiak limit for now

@chdsbd
Copy link
Owner

chdsbd commented Jun 29, 2023

I added some better logging, it seems like we're hitting a different rate limit.

I'm seeing errors for GraphQL queries (https://api.github.com/graphql) and check-runs https://api.github.com/repos/vercel/api/check-runs

{
  "documentation_url": "https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits",
  "message": "You have exceeded a secondary rate limit. Please wait a few minutes before you try again."
}

https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits

@chdsbd
Copy link
Owner

chdsbd commented Jun 30, 2023

I think we can update our throttling logic to wait for the "retry-after" time when we hit a secondary rate limit

@chdsbd
Copy link
Owner

chdsbd commented Jun 30, 2023

I'm actually unsure here, looking at the /rate_limit API, we have plenty of room. The only header we're receiving on the secondary rate limit error is a retry-after: 60.

I think it may be because we're making many requests concurrently, which the GitHub docs explicitly call out as a no-no 😬 : https://docs.github.com/en/rest/guides/best-practices-for-integrators?apiVersion=2022-11-28#dealing-with-secondary-rate-limits

Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently.

We could wait 60 seconds after hitting this limit, but that's going to slow things down

{
  "resources": {
    "core": {
      "limit": 15000,
      "used": 3,
      "remaining": 14997,
      "reset": 1688135146
    },
    "search": {
      "limit": 30,
      "used": 0,
      "remaining": 30,
      "reset": 1688131627
    },
    "graphql": {
      "limit": 12500,
      "used": 120,
      "remaining": 12380,
      "reset": 1688135139
    },
    "integration_manifest": {
      "limit": 5000,
      "used": 0,
      "remaining": 5000,
      "reset": 1688135167
    },
    "source_import": {
      "limit": 100,
      "used": 0,
      "remaining": 100,
      "reset": 1688131627
    },
    "code_scanning_upload": {
      "limit": 1000,
      "used": 0,
      "remaining": 1000,
      "reset": 1688135167
    },
    "actions_runner_registration": {
      "limit": 10000,
      "used": 0,
      "remaining": 10000,
      "reset": 1688135167
    },
    "scim": {
      "limit": 15000,
      "used": 0,
      "remaining": 15000,
      "reset": 1688135167
    },
    "dependency_snapshots": {
      "limit": 100,
      "used": 0,
      "remaining": 100,
      "reset": 1688131627
    },
    "code_search": {
      "limit": 10,
      "used": 0,
      "remaining": 10,
      "reset": 1688131627
    }
  },
  "rate": {
    "limit": 15000,
    "used": 3,
    "remaining": 14997,
    "reset": 1688135146
  }
}

@styfle
Copy link
Author

styfle commented Jun 30, 2023

Why 60 seconds? The linked docs say 1 second.

If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request.

@cdignam-segment
Copy link

Oh, I was just saying that because that’s what the retry-after header said

@styfle
Copy link
Author

styfle commented Jun 30, 2023

I suppose if the rate limit is hit, then yes it will have to wait for the duration of Retry-After.

But maybe setting a delay of 1 sec between requests will ensure the rate limit is never hit?

@styfle
Copy link
Author

styfle commented Jul 5, 2023

This one has been ready to merge for about 20 min vercel/vercel#10164

image

@chdsbd
Copy link
Owner

chdsbd commented Jul 5, 2023

Thanks for the ping here. I think I need to synchronize the graphql calls so we only make one per installation at a time. It seems like the REST API isn't returning the secondary rate limit errors.

I triggered a refresh of that pull request and it merged (you can do this by editing the pull request title, description, labels, or commits).

Kodiak wasn't actively looking at this PR, I think Kodiak hit the secondary rate limit and dropped the event, because we don't retry on those errors.

My TODOs are:

  • add some sort of lock so we only make one concurrent graphql request per installation
  • when we hit a secondary rate limit, enqueue the pull request again for evaluation instead of dropping it

@styfle
Copy link
Author

styfle commented Jul 5, 2023

Now I'm seeing a new error message. Perhaps GitHub's API is down?

image

Reference: vercel/next.js#52178

@chdsbd
Copy link
Owner

chdsbd commented Jul 5, 2023

I think that's because of the secondary rate limit, I'm going to try to temporarily reduce the graphql requests by limiting the ingest workers to 1

@styfle
Copy link
Author

styfle commented Jul 12, 2023

I'm seeing that error again today:

problem contacting GitHub API with method 'pull_request/update_branch'

image

vercel/next.js#52580

@chdsbd
Copy link
Owner

chdsbd commented Jul 14, 2023

Hey @styfle,

Sorry for the delayed reply. Last week after my message above I made a change to requeue pull requests on api rate limits, which I think ensures we no longer drop pull requests for these limits

I’ll take a look at the PR you linked

@chdsbd
Copy link
Owner

chdsbd commented Jul 14, 2023

This error is unrelated to this original issue

https://github.com/vercel/next.js/runs/14992559502

Here’s the issue for this error: #853

I talked to GitHub a while back and the product team there said this was expected behavior

@styfle
Copy link
Author

styfle commented Aug 3, 2023

Is this the same bug as #853 ?

kodiakhq: status — ⚠️ problem contacting GitHub API with method 'pull_request/merge'

image

vercel/next.js#53520

@styfle
Copy link
Author

styfle commented Aug 8, 2023

Closing since it seems the remaining issues we've had are the same as #853

@styfle styfle closed this as completed Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants