Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection rate limiting #948

Merged
merged 14 commits into from
May 15, 2024

Conversation

lijunwangs
Copy link

Problem

A client can be abusive and create the connections too fast to over load the server. Even we have per connection limit, it involves more heavy operations like taking lock in the connection table and evicting other connections. Also, a lot of different clients can collectively create too many connections too fast to overwhelm the server. This is observed especially in around the the time when the node becomes a leader.

Summary of Changes

Introduce connection rate limiter.

Limit connection rates from a single IP to 8/minutes

Limit the global connection rate to 2500/second -- 2500 is estimated from the default connection cache table size which is generous.

Fixes #

@lijunwangs lijunwangs force-pushed the connection_rate_limiting branch from 2ff4009 to 73170a8 Compare April 22, 2024 01:12
@codecov-commenter
Copy link

codecov-commenter commented Apr 22, 2024

Codecov Report

Attention: Patch coverage is 86.04651% with 24 lines in your changes are missing coverage. Please review.

Project coverage is 82.1%. Comparing base (aa2f078) to head (72c53b3).
Report is 109 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master     #948     +/-   ##
=========================================
- Coverage    82.1%    82.1%   -0.1%     
=========================================
  Files         886      889      +3     
  Lines      236417   236614    +197     
=========================================
+ Hits       194266   194407    +141     
- Misses      42151    42207     +56     

pgarg66
pgarg66 previously approved these changes Apr 23, 2024
Copy link

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not intend to backport this change? +325loc with no tests? new dependency? doesn't seem like it

streamer/src/nonblocking/connection_rate_limiter.rs Outdated Show resolved Hide resolved
validator/src/cli.rs Show resolved Hide resolved
streamer/src/nonblocking/quic.rs Outdated Show resolved Hide resolved
streamer/src/nonblocking/connection_rate_limiter.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated
@@ -214,6 +214,7 @@ generic-array = { version = "0.14.7", default-features = false }
gethostname = "0.2.3"
getrandom = "0.2.10"
goauth = "0.13.1"
governor = "0.6.3"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

introducing a new dependency will be a hard no on backport. is it really necessary? we already have the stream rate limiter logic that could be reused instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I look at our existing implementation and governor's implementation, I think governor implementation is pretty simple in the in-memory store implementation and well abstracted as evidenced to the simple instantiating the limiter and just check. Internally it kept conscious in memory usage and efficiency, e.g in the DirectRateLimiter case only a AtomicU64 is used for the state and no mutex is used. Adapting the EMA code is more complicated. Secondly, I am not keen to reinvent the wheel of a tool downloaded millions of times and used by a couple of thousand of projects.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are missing the point. it's (probably) fine for master. we are not taking a new dependency to the stable branch

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should get early and get tested in test net. It has been tested in 3gG for some time.

Copy link

@ryleung-solana ryleung-solana Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a hard no for backport? Agreed that this is preferred if possible as it is the most simple.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not tested. how hard is this guys? do we really all need first hand experience crashing mb to understand the gravity of what we're doing here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not proposing direct bp to v1.17 without going through tests. In my opinion each release to mb need to go thorough tests -- not just changes like these. Please review this PR based on master which is this PR about.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the implementation to use a very simple rate limiter without using governor for back port purpose.

@pgarg66
Copy link

pgarg66 commented Apr 26, 2024

LGTM. I can approve if comments from @t-nelson have been addressed.

ryleung-solana
ryleung-solana previously approved these changes Apr 28, 2024
Copy link

@ryleung-solana ryleung-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but would like @t-nelson updated opinion too

@lijunwangs lijunwangs force-pushed the connection_rate_limiting branch from 0f27864 to 57627ae Compare May 6, 2024 22:30
@lijunwangs
Copy link
Author

@t-nelson @pgarg66 @ryleung-solana Need your help to get this moving forward.

pub const DEFAULT_MAX_CONNECTIONS_PER_IPADDR_PER_MINUTE: u64 = 8;
const TOTAL_CONNECTIONS_PER_SECOND: u64 = 2500;

const CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD: usize = 100_000;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be named better. It's hard to understand its use by its name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will rename

return false;
}

self.count = self.count.saturating_add(1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bad place to increment the count. It's expecting the user of the API to not call is_allowed() multiple times for the same connection/stream.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand your concern. When a request is allowed to go through, we increment the count within that throttle window. I think even our stream throttling did the same thing so did governor. The difference in here is probably the counter is stored inside of the rate limiter.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is named is_allowed(). Generally, such functions can be called any number of times yielding the same results. In this case, if it's called repeatedly, self.count will exceed self.limit, returning a different result.

We should move the count update out to a different function. Or, rename it and document it better. It seems error prone to me in its current form.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the documentation. This function is not supposed to be run multiple times for the same request.

@lijunwangs lijunwangs merged commit f54c120 into anza-xyz:master May 15, 2024
49 checks passed
Copy link

mergify bot commented May 15, 2024

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

Copy link

mergify bot commented May 15, 2024

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify bot pushed a commit that referenced this pull request May 15, 2024
* use rate limit on connectings

use rate limit on connectings; missing file

* Change connection rate limit to 8/min instead of 4/s

* Addressed some feedback from Trent

* removed some comments

* fix test failures which are opening connections more frequently

* moved the flag up

* turn off rate limiting to debug CI

* Fix CI test failures

* differentiate of the two throttling cases in stats: across connections or per ip addr

* fmt issues

* Addressed some feedback from Trent

* Added unit tests

Cleanup connection cache rate limiter if exceeding certain threshold

missing files

CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD to 100_000

clippy issue

clippy issue

sort crates

* revert Cargo.lock changes

* Addressed some feedback from Pankaj

(cherry picked from commit f54c120)

# Conflicts:
#	client/src/connection_cache.rs
#	quic-client/tests/quic_client.rs
#	streamer/src/nonblocking/quic.rs
#	streamer/src/quic.rs
#	validator/src/cli.rs
mergify bot pushed a commit that referenced this pull request May 15, 2024
* use rate limit on connectings

use rate limit on connectings; missing file

* Change connection rate limit to 8/min instead of 4/s

* Addressed some feedback from Trent

* removed some comments

* fix test failures which are opening connections more frequently

* moved the flag up

* turn off rate limiting to debug CI

* Fix CI test failures

* differentiate of the two throttling cases in stats: across connections or per ip addr

* fmt issues

* Addressed some feedback from Trent

* Added unit tests

Cleanup connection cache rate limiter if exceeding certain threshold

missing files

CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD to 100_000

clippy issue

clippy issue

sort crates

* revert Cargo.lock changes

* Addressed some feedback from Pankaj

(cherry picked from commit f54c120)

# Conflicts:
#	streamer/src/quic.rs
#	validator/src/cli.rs
continue;
}

if rate_limiter.len() > CONNECITON_RATE_LIMITER_CLEANUP_SIZE_THRESHOLD {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would not this check cause a lot of troubles in the worst case when we would have more than threshold entries? every connection would cause an iteration of over 100k entries
maybe retain_recent should not be called more often than 10ms for example?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that we might be in situation when we have > 100k IP addresses in the map and all of these connections are recent (<60s) so we will go over this dashmap over and over without removing enough connections to be below 100k? Or this situation is only hypothetical because having ~100k IP addresses in the last minute is almost impossible event?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that we might be in situation when we have > 100k IP addresses in the map and all of these connections are recent (<60s) so we will go over this dashmap over and over without removing enough connections to be below 100k?

Yes, this looks correct, except that we also have a check above that does not allow more than 150k connections per 60s (based on TOTAL_CONNECTIONS_PER_SECOND = 2500).

Or this situation is only hypothetical because having ~100k IP addresses in the last minute is almost impossible event?

Not necessary to have 100k IP's, node allows multiple connections from 1 IP, 8 iirc? Still sounds hard but possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants