Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy control plane rate limiter #5785

Merged
merged 30 commits into from
Nov 15, 2023
Merged

Conversation

khanova
Copy link
Contributor

@khanova khanova commented Nov 6, 2023

Problem

Proxy might overload the control plane.

Summary of changes

Implement rate limiter for proxy<->control plane connection.
Resolves #5707

Used implementation ideas from https://github.com/conradludgate/squeeze/

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

Cargo.lock Outdated Show resolved Hide resolved
Cargo.lock Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Nov 6, 2023

2388 tests run: 2272 passed, 0 failed, 116 skipped (full report)


Flaky tests (3)

Postgres 16

  • test_branching_with_pgbench[flat-1-10]: debug
  • test_pageserver_restarts_under_worload: debug
  • test_empty_tenant_size: debug

Code coverage (full report)

  • functions: 54.8% (9035 of 16495 functions)
  • lines: 81.5% (51968 of 63747 lines)

The comment gets automatically updated with the latest test results
466ce49 at 2023-11-14T18:19:34.742Z :recycle:

proxy/src/bin/proxy.rs Outdated Show resolved Hide resolved
@khanova khanova requested a review from vadim2404 November 7, 2023 10:24
@khanova khanova marked this pull request as ready for review November 7, 2023 10:24
@khanova khanova requested a review from a team as a code owner November 7, 2023 10:24
@conradludgate
Copy link
Contributor

conradludgate commented Nov 7, 2023

Let me give some backstory on the pin-list semaphore idea - Although I see it's not included in this current PR (but the pin-list dependency is still there)

why was it necessary

I needed a way to remove available permits at will - eg when the congestion control algorithm wants to decrease the available concurrency.

tokio::sync::Semaphore only allows adding additional permits. To remove permits, you need to await an acquire and then forget them. This will have a lagged effect as these acquire calls with be at the back of the queue.

Solution

Write our own semaphore. Easier said than done. Inspired by tokio's implementation which uses a Mutex<LinkedList> where the linked list is intrusive. Writing intrusive linked lists requires unsafe code which I would like to avoid. Thankfully, a friend of mine authored this crate called pin-list which has a safe abstraction for an efficient intrusive-linked list.

I took the code from tokio, and built it on top of pin-list and made a few notable changes:

  1. Instead of a single permits available coutner, I switched to 2 counters
  2. I removed some of the code that dealt with acquiring multiple permits as it added a bit of extra unnecessary complexity

@khanova
Copy link
Contributor Author

khanova commented Nov 7, 2023

Let me give some backstory on the pin-list semaphore idea - Although I see it's not included in this current PR (but the pin-list dependency is still there)

why was it necessary

I needed a way to remove available permits at will - eg when the congestion control algorithm wants to decrease the available concurrency.

tokio::sync::Semaphore only allows adding additional permits. To remove permits, you need to await an acquire and then forget them. This will have a lagged effect as these acquire calls with be at the back of the queue.

Solution

Write our own semaphore. Easier said than done. Inspired by tokio's implementation which uses a Mutex<LinkedList> where the linked list is intrusive. Writing intrusive linked lists requires unsafe code which I would like to avoid. Thankfully, a friend of mine authored this crate called pin-list which has a safe abstraction for an efficient intrusive-linked list.

I took the code from tokio, and built it on top of pin-list and made a few notable changes:

  1. Instead of a single permits available coutner, I switched to 2 counters
  2. I removed some of the code that dealt with acquiring multiple permits as it added a bit of extra unnecessary complexity

I see, yes, implementing semaphore is a non-trivial task.

Originally I wanted to reuse your implementation with pin-list, but it looks like the bot is not very happy about this dependency.

My assumption was that it should be very rare situation when the control plane is overloaded and there is still non-trivial amount of available permits. Than it is not a problem to forget permits on the release. But this assumption might be completely wrong.

What do you think about it?

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some prometheus metrics to monitor the rate limiting would be nice.

If I understand correctly, this is a global rate limit on the number of requests to the control plane. Does it have any "fairness" built into it? If one user sends a lot of requests, can it saturate the limiter easily, effectively causing an outage for everyone else?

In production, we currently run three console/control plane instances behind a load balancer. If one of them is overloaded for some reason and fails all requests, but others are working correctly, how does the rate limiting algorithm behave? In the future, we will also have separate control plane instances in each region.

More tests would be nice. There are unit tests for the algorithm, but I'd also like to see some python tests, testing the throttling in the real proxy.

proxy/src/bin/proxy.rs Outdated Show resolved Hide resolved
proxy/src/bin/proxy.rs Outdated Show resolved Hide resolved
proxy/src/bin/proxy.rs Outdated Show resolved Hide resolved
@hlinnaka
Copy link
Contributor

hlinnaka commented Nov 7, 2023

If I understand correctly, this is a global rate limit on the number of requests to the control plane. Does it have any "fairness" built into it? If one user sends a lot of requests, can it saturate the limiter easily, effectively causing an outage for everyone else?

I see that #5799 addresses that, with a per-endpoint lock. When both of these PRs are merged, I presume we will acquire the per-endpoint lock first, and the global limiter permit after that. That seems OK. Per IP address limiting would be nice too, to avoid DoSsing the control plane with 'get_auth_info' requests or saturating this rate limiter, but that's a different story.

@khanova khanova requested a review from conradludgate November 9, 2023 15:50
proxy/src/rate_limiter/limit_algorithm.rs Outdated Show resolved Hide resolved
proxy/src/bin/proxy.rs Outdated Show resolved Hide resolved
@khanova khanova enabled auto-merge (squash) November 14, 2023 16:18
@khanova khanova merged commit 2f0d245 into main Nov 15, 2023
37 checks passed
@khanova khanova deleted the proxy-control-plane-rate-limiter branch November 15, 2023 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proxy: dynamic rate limiting
5 participants