Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent possible slow resolution of rate limiter #433

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sam-super
Copy link

@sam-super sam-super commented Nov 28, 2024

By submitting a PR to this repository, you agree to the terms within the Auth0 Code of Conduct. Please see the contributing guidelines for how to create and submit a high-quality PR for this repo.

Description

We have found scenarios at scale where our requests lock up for 10+ seconds and wait for something in the rate limiter while resolving a jwk endpoint. This was found when we had a consistent rate (around 2 per-second) of requests trying to validate a kid which didn't exist in the jwk endpoint, and so caused a lot of cache-misses, which caused the rate limiter to kick in.

We believe the intention is for the rate limiter to return immediately and not pause to wait for available tokens (hence passing the fireImmediately arg to the RateLimiter constructor).

However, maybe because of a race condition at scale (unfortunately we can't reproduce this locally), it seems possible to end up trying to resolve a token from the bucket even when there are no tokens left:
https://github.com/jhurliman/node-rate-limiter/blob/main/src/RateLimiter.ts#L83
Once we are on this path the rate limiter can trigger a wait for new 'tokens' become available:
https://github.com/jhurliman/node-rate-limiter/blob/main/src/TokenBucket.ts#L96

This fix uses the simpler tryRemoveTokens method which synchronously returns a boolean if tokens could be taken (and so can't pause execution to wait for tokens to be available).

Although we can't write a specific test for this, the new code seems simpler and passes the existing rate-limit tests, so it seems like a good change anyway.

Testing

There are existing tests to cover this.

Checklist

  • I have added documentation for new/changed functionality in this PR or in auth0.com/docs
  • All active GitHub checks for tests, formatting, and security are passing
  • The correct base branch is being used, if not the default branch

We have found scenarios at scale where our requests lock up for 10+ seconds and wait for something in the rate limiter
while resolving a jwk endpoint.
We believe the intention is for the rate limiter to return immediately and not pause to wait for available tokens
(hence passing the fireImmediately arg to the RateLimiter constructor).

However, maybe because of a race condition at scale, it seems possible to end up trying
to resolve a token from the bucket even when there are no tokens left:
https://github.com/jhurliman/node-rate-limiter/blob/main/src/RateLimiter.ts#L83
Once we are on this path the rate limiter can trigger a wait for new 'tokens' become available:
https://github.com/jhurliman/node-rate-limiter/blob/main/src/TokenBucket.ts#L96

This fix uses the simpler `tryRemoveTokens` method which synchronously returns a boolean if tokens could be taken
(and so can't pause execution to wait for tokens to be available).
@sam-super sam-super requested a review from a team as a code owner November 28, 2024 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant