Connection rate limiting #948

lijunwangs · 2024-04-21T08:40:25Z

Problem

A client can be abusive and create the connections too fast to over load the server. Even we have per connection limit, it involves more heavy operations like taking lock in the connection table and evicting other connections. Also, a lot of different clients can collectively create too many connections too fast to overwhelm the server. This is observed especially in around the the time when the node becomes a leader.

Summary of Changes

Introduce connection rate limiter.

Limit connection rates from a single IP to 8/minutes

Limit the global connection rate to 2500/second -- 2500 is estimated from the default connection cache table size which is generous.

Fixes #

codecov-commenter · 2024-04-22T06:45:35Z

Codecov Report

Attention: Patch coverage is 86.04651% with 24 lines in your changes are missing coverage. Please review.

Project coverage is 82.1%. Comparing base (aa2f078) to head (72c53b3).
Report is 109 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master     #948     +/-   ##
=========================================
- Coverage    82.1%    82.1%   -0.1%     
=========================================
  Files         886      889      +3     
  Lines      236417   236614    +197     
=========================================
+ Hits       194266   194407    +141     
- Misses      42151    42207     +56

streamer/src/nonblocking/quic.rs

t-nelson

do we not intend to backport this change? +325loc with no tests? new dependency? doesn't seem like it

streamer/src/nonblocking/connection_rate_limiter.rs

validator/src/cli.rs

streamer/src/nonblocking/quic.rs

streamer/src/nonblocking/connection_rate_limiter.rs

t-nelson · 2024-04-23T04:18:55Z

Cargo.toml

@@ -214,6 +214,7 @@ generic-array = { version = "0.14.7", default-features = false }
 gethostname = "0.2.3"
 getrandom = "0.2.10"
 goauth = "0.13.1"
+governor = "0.6.3"


introducing a new dependency will be a hard no on backport. is it really necessary? we already have the stream rate limiter logic that could be reused instead

I look at our existing implementation and governor's implementation, I think governor implementation is pretty simple in the in-memory store implementation and well abstracted as evidenced to the simple instantiating the limiter and just check. Internally it kept conscious in memory usage and efficiency, e.g in the DirectRateLimiter case only a AtomicU64 is used for the state and no mutex is used. Adapting the EMA code is more complicated. Secondly, I am not keen to reinvent the wheel of a tool downloaded millions of times and used by a couple of thousand of projects.

you are missing the point. it's (probably) fine for master. we are not taking a new dependency to the stable branch

This should get early and get tested in test net. It has been tested in 3gG for some time.

Why is it a hard no for backport? Agreed that this is preferred if possible as it is the most simple.

it's not tested. how hard is this guys? do we really all need first hand experience crashing mb to understand the gravity of what we're doing here?

I am not proposing direct bp to v1.17 without going through tests. In my opinion each release to mb need to go thorough tests -- not just changes like these. Please review this PR based on master which is this PR about.

I have changed the implementation to use a very simple rate limiter without using governor for back port purpose.

pgarg66 · 2024-04-26T19:04:39Z

LGTM. I can approve if comments from @t-nelson have been addressed.

ryleung-solana

LGTM but would like @t-nelson updated opinion too

use rate limit on connectings; missing file

…s or per ip addr

Cleanup connection cache rate limiter if exceeding certain threshold missing files CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD to 100_000 clippy issue clippy issue sort crates

lijunwangs · 2024-05-09T17:40:05Z

@t-nelson @pgarg66 @ryleung-solana Need your help to get this moving forward.

pgarg66 · 2024-05-13T15:52:50Z

streamer/src/nonblocking/quic.rs

+pub const DEFAULT_MAX_CONNECTIONS_PER_IPADDR_PER_MINUTE: u64 = 8;
+const TOTAL_CONNECTIONS_PER_SECOND: u64 = 2500;
+
+const CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD: usize = 100_000;


This could probably be named better. It's hard to understand its use by its name.

will rename

pgarg66 · 2024-05-13T15:55:27Z

streamer/src/nonblocking/rate_limiter.rs

+            return false;
+        }
+
+        self.count = self.count.saturating_add(1);


This seems a bad place to increment the count. It's expecting the user of the API to not call is_allowed() multiple times for the same connection/stream.

Not sure I understand your concern. When a request is allowed to go through, we increment the count within that throttle window. I think even our stream throttling did the same thing so did governor. The difference in here is probably the counter is stored inside of the rate limiter.

The function is named is_allowed(). Generally, such functions can be called any number of times yielding the same results. In this case, if it's called repeatedly, self.count will exceed self.limit, returning a different result.

We should move the count update out to a different function. Or, rename it and document it better. It seems error prone to me in its current form.

I will update the documentation. This function is not supposed to be run multiple times for the same request.

mergify · 2024-05-15T02:25:54Z

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

mergify · 2024-05-15T02:25:55Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

* use rate limit on connectings use rate limit on connectings; missing file * Change connection rate limit to 8/min instead of 4/s * Addressed some feedback from Trent * removed some comments * fix test failures which are opening connections more frequently * moved the flag up * turn off rate limiting to debug CI * Fix CI test failures * differentiate of the two throttling cases in stats: across connections or per ip addr * fmt issues * Addressed some feedback from Trent * Added unit tests Cleanup connection cache rate limiter if exceeding certain threshold missing files CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD to 100_000 clippy issue clippy issue sort crates * revert Cargo.lock changes * Addressed some feedback from Pankaj (cherry picked from commit f54c120) # Conflicts: # client/src/connection_cache.rs # quic-client/tests/quic_client.rs # streamer/src/nonblocking/quic.rs # streamer/src/quic.rs # validator/src/cli.rs

* use rate limit on connectings use rate limit on connectings; missing file * Change connection rate limit to 8/min instead of 4/s * Addressed some feedback from Trent * removed some comments * fix test failures which are opening connections more frequently * moved the flag up * turn off rate limiting to debug CI * Fix CI test failures * differentiate of the two throttling cases in stats: across connections or per ip addr * fmt issues * Addressed some feedback from Trent * Added unit tests Cleanup connection cache rate limiter if exceeding certain threshold missing files CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD to 100_000 clippy issue clippy issue sort crates * revert Cargo.lock changes * Addressed some feedback from Pankaj (cherry picked from commit f54c120) # Conflicts: # streamer/src/quic.rs # validator/src/cli.rs

fanatid · 2024-05-16T02:04:24Z

streamer/src/nonblocking/quic.rs

+                continue;
+            }
+
+            if rate_limiter.len() > CONNECITON_RATE_LIMITER_CLEANUP_SIZE_THRESHOLD {


would not this check cause a lot of troubles in the worst case when we would have more than threshold entries? every connection would cause an iteration of over 100k entries
maybe retain_recent should not be called more often than 10ms for example?

Am I right that we might be in situation when we have > 100k IP addresses in the map and all of these connections are recent (<60s) so we will go over this dashmap over and over without removing enough connections to be below 100k? Or this situation is only hypothetical because having ~100k IP addresses in the last minute is almost impossible event?

Am I right that we might be in situation when we have > 100k IP addresses in the map and all of these connections are recent (<60s) so we will go over this dashmap over and over without removing enough connections to be below 100k?

Yes, this looks correct, except that we also have a check above that does not allow more than 150k connections per 60s (based on TOTAL_CONNECTIONS_PER_SECOND = 2500).

Or this situation is only hypothetical because having ~100k IP addresses in the last minute is almost impossible event?

Not necessary to have 100k IP's, node allows multiple connections from 1 IP, 8 iirc? Still sounds hard but possible?

lijunwangs force-pushed the connection_rate_limiting branch from 2ff4009 to 73170a8 Compare April 22, 2024 01:12

lijunwangs requested review from pgarg66, alessandrod, ryleung-solana and sakridge April 22, 2024 16:38

pgarg66 reviewed Apr 22, 2024

View reviewed changes

streamer/src/nonblocking/quic.rs Outdated Show resolved Hide resolved

pgarg66 reviewed Apr 22, 2024

View reviewed changes

streamer/src/nonblocking/quic.rs Outdated Show resolved Hide resolved

pgarg66 previously approved these changes Apr 23, 2024

View reviewed changes

t-nelson reviewed Apr 23, 2024

View reviewed changes

lijunwangs dismissed pgarg66’s stale review via 13d3cd6 April 24, 2024 02:10

lijunwangs force-pushed the connection_rate_limiting branch from 13d3cd6 to 07ee465 Compare April 24, 2024 02:20

ryleung-solana previously approved these changes Apr 28, 2024

View reviewed changes

lijunwangs dismissed ryleung-solana’s stale review via 0f27864 May 6, 2024 20:40

lijunwangs added 12 commits May 6, 2024 14:46

use rate limit on connectings

84745e6

use rate limit on connectings; missing file

Change connection rate limit to 8/min instead of 4/s

8c66813

Addressed some feedback from Trent

eaffddb

removed some comments

a2a1b76

fix test failures which are opening connections more frequently

9d03705

moved the flag up

86a9f01

turn off rate limiting to debug CI

0425c34

Fix CI test failures

7c670c0

differentiate of the two throttling cases in stats: across connection…

1e2e6e7

…s or per ip addr

fmt issues

329de73

Addressed some feedback from Trent

1346508

Added unit tests

57627ae

Cleanup connection cache rate limiter if exceeding certain threshold missing files CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD to 100_000 clippy issue clippy issue sort crates

lijunwangs force-pushed the connection_rate_limiting branch from 0f27864 to 57627ae Compare May 6, 2024 22:30

revert Cargo.lock changes

2148a63

pgarg66 reviewed May 13, 2024

View reviewed changes

Addressed some feedback from Pankaj

72c53b3

pgarg66 approved these changes May 14, 2024

View reviewed changes

lijunwangs merged commit f54c120 into anza-xyz:master May 15, 2024
49 checks passed

lijunwangs added v1.18 v1.17 labels May 15, 2024

mergify bot mentioned this pull request May 15, 2024

v1.17: Connection rate limiting (backport of #948) #1361

Closed

mergify bot mentioned this pull request May 15, 2024

v1.18: Connection rate limiting (backport of #948) #1362

Closed

fanatid reviewed May 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection rate limiting #948

Connection rate limiting #948

lijunwangs commented Apr 21, 2024

codecov-commenter commented Apr 22, 2024 •

edited

Loading

t-nelson left a comment

t-nelson Apr 23, 2024

lijunwangs Apr 24, 2024

t-nelson Apr 25, 2024

lijunwangs Apr 26, 2024

ryleung-solana Apr 28, 2024 •

edited

Loading

t-nelson Apr 30, 2024

lijunwangs Apr 30, 2024

lijunwangs May 6, 2024

pgarg66 commented Apr 26, 2024

ryleung-solana left a comment

lijunwangs commented May 9, 2024

pgarg66 May 13, 2024

lijunwangs May 14, 2024

pgarg66 May 13, 2024

lijunwangs May 14, 2024

pgarg66 May 14, 2024

lijunwangs May 14, 2024

mergify bot commented May 15, 2024

mergify bot commented May 15, 2024

fanatid May 16, 2024

KirillLykov Jul 7, 2024

fanatid Jul 8, 2024

Connection rate limiting #948

Connection rate limiting #948

Conversation

lijunwangs commented Apr 21, 2024

Problem

Summary of Changes

codecov-commenter commented Apr 22, 2024 • edited Loading

Codecov Report

t-nelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryleung-solana Apr 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgarg66 commented Apr 26, 2024

ryleung-solana left a comment

Choose a reason for hiding this comment

lijunwangs commented May 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented May 15, 2024

mergify bot commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Apr 22, 2024 •

edited

Loading

ryleung-solana Apr 28, 2024 •

edited

Loading