Proxy control plane rate limiter #5785

khanova · 2023-11-06T09:43:50Z

Problem

Proxy might overload the control plane.

Summary of changes

Implement rate limiter for proxy<->control plane connection.
Resolves #5707

Used implementation ideas from https://github.com/conradludgate/squeeze/

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

…l-plane-rate-limiter

Cargo.lock

github-actions · 2023-11-06T10:26:14Z

2388 tests run: 2272 passed, 0 failed, 116 skipped (full report)

Flaky tests (3)

Postgres 16

test_branching_with_pgbench[flat-1-10]: debug
test_pageserver_restarts_under_worload: debug
test_empty_tenant_size: debug

Code coverage (full report)

functions: 54.8% (9035 of 16495 functions)
lines: 81.5% (51968 of 63747 lines)

_{The comment gets automatically updated with the latest test results
466ce49 at 2023-11-14T18:19:34.742Z :recycle:}

proxy/src/rate_limiter/limit_algorithm.rs

proxy/src/bin/proxy.rs

…l-plane-rate-limiter

conradludgate · 2023-11-07T10:54:19Z

Let me give some backstory on the pin-list semaphore idea - Although I see it's not included in this current PR (but the pin-list dependency is still there)

why was it necessary

I needed a way to remove available permits at will - eg when the congestion control algorithm wants to decrease the available concurrency.

tokio::sync::Semaphore only allows adding additional permits. To remove permits, you need to await an acquire and then forget them. This will have a lagged effect as these acquire calls with be at the back of the queue.

Solution

Write our own semaphore. Easier said than done. Inspired by tokio's implementation which uses a Mutex<LinkedList> where the linked list is intrusive. Writing intrusive linked lists requires unsafe code which I would like to avoid. Thankfully, a friend of mine authored this crate called pin-list which has a safe abstraction for an efficient intrusive-linked list.

I took the code from tokio, and built it on top of pin-list and made a few notable changes:

Instead of a single permits available coutner, I switched to 2 counters
I removed some of the code that dealt with acquiring multiple permits as it added a bit of extra unnecessary complexity

…l-plane-rate-limiter

khanova · 2023-11-07T11:28:49Z

Let me give some backstory on the pin-list semaphore idea - Although I see it's not included in this current PR (but the pin-list dependency is still there)

why was it necessary

I needed a way to remove available permits at will - eg when the congestion control algorithm wants to decrease the available concurrency.

tokio::sync::Semaphore only allows adding additional permits. To remove permits, you need to await an acquire and then forget them. This will have a lagged effect as these acquire calls with be at the back of the queue.

Solution

Write our own semaphore. Easier said than done. Inspired by tokio's implementation which uses a Mutex<LinkedList> where the linked list is intrusive. Writing intrusive linked lists requires unsafe code which I would like to avoid. Thankfully, a friend of mine authored this crate called pin-list which has a safe abstraction for an efficient intrusive-linked list.

I took the code from tokio, and built it on top of pin-list and made a few notable changes:

Instead of a single permits available coutner, I switched to 2 counters

I removed some of the code that dealt with acquiring multiple permits as it added a bit of extra unnecessary complexity

I see, yes, implementing semaphore is a non-trivial task.

Originally I wanted to reuse your implementation with pin-list, but it looks like the bot is not very happy about this dependency.

My assumption was that it should be very rare situation when the control plane is overloaded and there is still non-trivial amount of available permits. Than it is not a problem to forget permits on the release. But this assumption might be completely wrong.

What do you think about it?

hlinnaka

Some prometheus metrics to monitor the rate limiting would be nice.

If I understand correctly, this is a global rate limit on the number of requests to the control plane. Does it have any "fairness" built into it? If one user sends a lot of requests, can it saturate the limiter easily, effectively causing an outage for everyone else?

In production, we currently run three console/control plane instances behind a load balancer. If one of them is overloaded for some reason and fails all requests, but others are working correctly, how does the rate limiting algorithm behave? In the future, we will also have separate control plane instances in each region.

More tests would be nice. There are unit tests for the algorithm, but I'd also like to see some python tests, testing the throttling in the real proxy.

proxy/src/bin/proxy.rs

hlinnaka · 2023-11-07T12:32:21Z

If I understand correctly, this is a global rate limit on the number of requests to the control plane. Does it have any "fairness" built into it? If one user sends a lot of requests, can it saturate the limiter easily, effectively causing an outage for everyone else?

I see that #5799 addresses that, with a per-endpoint lock. When both of these PRs are merged, I presume we will acquire the per-endpoint lock first, and the global limiter permit after that. That seems OK. Per IP address limiting would be nice too, to avoid DoSsing the control plane with 'get_auth_info' requests or saturating this rate limiter, but that's a different story.

…l-plane-rate-limiter

proxy/src/rate_limiter/limit_algorithm.rs

proxy/src/bin/proxy.rs

…l-plane-rate-limiter

Anna Khanova added 4 commits November 3, 2023 11:19

Initial commit

b2c1996

Merge branch 'main' of github.com:neondatabase/neon into proxy-contro…

bd75723

…l-plane-rate-limiter

Update rate limiter

627df49

Merge branch 'main' of github.com:neondatabase/neon into proxy-contro…

90cc70a

…l-plane-rate-limiter

arnica-github-connector bot reviewed Nov 6, 2023

View reviewed changes

Cargo.lock Outdated Show resolved Hide resolved

arnica-github-connector bot reviewed Nov 6, 2023

View reviewed changes

Cargo.lock Outdated Show resolved Hide resolved

vadim2404 reviewed Nov 7, 2023

View reviewed changes

proxy/src/rate_limiter/limit_algorithm.rs Outdated Show resolved Hide resolved

vadim2404 reviewed Nov 7, 2023

View reviewed changes

proxy/src/bin/proxy.rs Outdated Show resolved Hide resolved

Anna Khanova added 3 commits November 7, 2023 11:18

Merge branch 'main' of github.com:neondatabase/neon into proxy-contro…

576e98f

…l-plane-rate-limiter

Update semaphore

bbe3404

Address issues.

e4ea604

khanova requested a review from vadim2404 November 7, 2023 10:24

khanova marked this pull request as ready for review November 7, 2023 10:24

khanova requested a review from a team as a code owner November 7, 2023 10:24

Anna Khanova added 2 commits November 7, 2023 12:17

Remove pin-list

9b296e4

Merge branch 'main' of github.com:neondatabase/neon into proxy-contro…

e2ad5b2

…l-plane-rate-limiter

hlinnaka reviewed Nov 7, 2023

View reviewed changes

proxy/src/bin/proxy.rs Outdated Show resolved Hide resolved

proxy/src/bin/proxy.rs Outdated Show resolved Hide resolved

proxy/src/bin/proxy.rs Outdated Show resolved Hide resolved

Anna Khanova added 4 commits November 9, 2023 16:17

Added pytest

695bdac

Merge branch 'main' of github.com:neondatabase/neon into proxy-contro…

60c5805

…l-plane-rate-limiter

Report metrics.

6f0fce6

Merge branch 'main' of github.com:neondatabase/neon into proxy-contro…

e9d7072

…l-plane-rate-limiter

khanova requested a review from conradludgate November 9, 2023 15:50

Anna Khanova added 4 commits November 9, 2023 16:52

Remove unused import

6fb44ea

Fix tests

b06bcd4

Fix typo

1c1e6ae

Fix python codestyle

af5f0e0

Anna Khanova added 7 commits November 9, 2023 18:09

Merge branch 'main' of github.com:neondatabase/neon into proxy-contro…

ca560e0

…l-plane-rate-limiter

Fmt

8f18088

Merge branch 'main' of github.com:neondatabase/neon into proxy-contro…

8c227c5

…l-plane-rate-limiter

More metrics

0ffe93b

Merge branch 'main' of github.com:neondatabase/neon into proxy-contro…

c3610f3

…l-plane-rate-limiter

Guard add_permits

f409b60

Fix flag

b13648c

conradludgate reviewed Nov 10, 2023

View reviewed changes

proxy/src/rate_limiter/limit_algorithm.rs Outdated Show resolved Hide resolved

proxy/src/bin/proxy.rs Outdated Show resolved Hide resolved

Anna Khanova added 3 commits November 10, 2023 18:23

Fix test

54ae71a

Address issues

9852c47

Merge branch 'main' of github.com:neondatabase/neon into proxy-contro…

1d2dbc7

…l-plane-rate-limiter

khanova requested a review from conradludgate November 10, 2023 17:56

Fmt

6144a07

conradludgate approved these changes Nov 14, 2023

View reviewed changes

Merge branch 'main' into proxy-control-plane-rate-limiter

eb3e8fc

khanova enabled auto-merge (squash) November 14, 2023 16:18

Merge branch 'main' into proxy-control-plane-rate-limiter

466ce49

khanova merged commit 2f0d245 into main Nov 15, 2023
37 checks passed

khanova deleted the proxy-control-plane-rate-limiter branch November 15, 2023 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy control plane rate limiter #5785

Proxy control plane rate limiter #5785

khanova commented Nov 6, 2023 •

edited

Loading

github-actions bot commented Nov 6, 2023 •

edited

Loading

Postgres 16

conradludgate commented Nov 7, 2023 •

edited

Loading

khanova commented Nov 7, 2023

why was it necessary

Solution

hlinnaka left a comment

hlinnaka commented Nov 7, 2023

Proxy control plane rate limiter #5785

Proxy control plane rate limiter #5785

Conversation

khanova commented Nov 6, 2023 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Nov 6, 2023 • edited Loading

2388 tests run: 2272 passed, 0 failed, 116 skipped (full report)

Postgres 16

Code coverage (full report)

conradludgate commented Nov 7, 2023 • edited Loading

why was it necessary

Solution

khanova commented Nov 7, 2023

why was it necessary

Solution

hlinnaka left a comment

Choose a reason for hiding this comment

hlinnaka commented Nov 7, 2023

khanova commented Nov 6, 2023 •

edited

Loading

github-actions bot commented Nov 6, 2023 •

edited

Loading

conradludgate commented Nov 7, 2023 •

edited

Loading