-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection rate limiting #948
Connection rate limiting #948
Conversation
2ff4009
to
73170a8
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #948 +/- ##
=========================================
- Coverage 82.1% 82.1% -0.1%
=========================================
Files 886 889 +3
Lines 236417 236614 +197
=========================================
+ Hits 194266 194407 +141
- Misses 42151 42207 +56 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not intend to backport this change? +325loc with no tests? new dependency? doesn't seem like it
Cargo.toml
Outdated
@@ -214,6 +214,7 @@ generic-array = { version = "0.14.7", default-features = false } | |||
gethostname = "0.2.3" | |||
getrandom = "0.2.10" | |||
goauth = "0.13.1" | |||
governor = "0.6.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
introducing a new dependency will be a hard no on backport. is it really necessary? we already have the stream rate limiter logic that could be reused instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I look at our existing implementation and governor's implementation, I think governor implementation is pretty simple in the in-memory store implementation and well abstracted as evidenced to the simple instantiating the limiter and just check. Internally it kept conscious in memory usage and efficiency, e.g in the DirectRateLimiter case only a AtomicU64 is used for the state and no mutex is used. Adapting the EMA code is more complicated. Secondly, I am not keen to reinvent the wheel of a tool downloaded millions of times and used by a couple of thousand of projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are missing the point. it's (probably) fine for master. we are not taking a new dependency to the stable branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should get early and get tested in test net. It has been tested in 3gG for some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it a hard no for backport? Agreed that this is preferred if possible as it is the most simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not tested. how hard is this guys? do we really all need first hand experience crashing mb to understand the gravity of what we're doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not proposing direct bp to v1.17 without going through tests. In my opinion each release to mb need to go thorough tests -- not just changes like these. Please review this PR based on master which is this PR about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the implementation to use a very simple rate limiter without using governor for back port purpose.
13d3cd6
to
07ee465
Compare
LGTM. I can approve if comments from @t-nelson have been addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but would like @t-nelson updated opinion too
use rate limit on connectings; missing file
Cleanup connection cache rate limiter if exceeding certain threshold missing files CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD to 100_000 clippy issue clippy issue sort crates
0f27864
to
57627ae
Compare
@t-nelson @pgarg66 @ryleung-solana Need your help to get this moving forward. |
streamer/src/nonblocking/quic.rs
Outdated
pub const DEFAULT_MAX_CONNECTIONS_PER_IPADDR_PER_MINUTE: u64 = 8; | ||
const TOTAL_CONNECTIONS_PER_SECOND: u64 = 2500; | ||
|
||
const CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD: usize = 100_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably be named better. It's hard to understand its use by its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will rename
return false; | ||
} | ||
|
||
self.count = self.count.saturating_add(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bad place to increment the count. It's expecting the user of the API to not call is_allowed()
multiple times for the same connection/stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand your concern. When a request is allowed to go through, we increment the count within that throttle window. I think even our stream throttling did the same thing so did governor. The difference in here is probably the counter is stored inside of the rate limiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is named is_allowed()
. Generally, such functions can be called any number of times yielding the same results. In this case, if it's called repeatedly, self.count
will exceed self.limit
, returning a different result.
We should move the count update out to a different function. Or, rename it and document it better. It seems error prone to me in its current form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the documentation. This function is not supposed to be run multiple times for the same request.
Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. |
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
* use rate limit on connectings use rate limit on connectings; missing file * Change connection rate limit to 8/min instead of 4/s * Addressed some feedback from Trent * removed some comments * fix test failures which are opening connections more frequently * moved the flag up * turn off rate limiting to debug CI * Fix CI test failures * differentiate of the two throttling cases in stats: across connections or per ip addr * fmt issues * Addressed some feedback from Trent * Added unit tests Cleanup connection cache rate limiter if exceeding certain threshold missing files CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD to 100_000 clippy issue clippy issue sort crates * revert Cargo.lock changes * Addressed some feedback from Pankaj (cherry picked from commit f54c120) # Conflicts: # client/src/connection_cache.rs # quic-client/tests/quic_client.rs # streamer/src/nonblocking/quic.rs # streamer/src/quic.rs # validator/src/cli.rs
* use rate limit on connectings use rate limit on connectings; missing file * Change connection rate limit to 8/min instead of 4/s * Addressed some feedback from Trent * removed some comments * fix test failures which are opening connections more frequently * moved the flag up * turn off rate limiting to debug CI * Fix CI test failures * differentiate of the two throttling cases in stats: across connections or per ip addr * fmt issues * Addressed some feedback from Trent * Added unit tests Cleanup connection cache rate limiter if exceeding certain threshold missing files CONNECITON_RATE_LIMITER_CLEANUP_THRESHOLD to 100_000 clippy issue clippy issue sort crates * revert Cargo.lock changes * Addressed some feedback from Pankaj (cherry picked from commit f54c120) # Conflicts: # streamer/src/quic.rs # validator/src/cli.rs
continue; | ||
} | ||
|
||
if rate_limiter.len() > CONNECITON_RATE_LIMITER_CLEANUP_SIZE_THRESHOLD { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not this check cause a lot of troubles in the worst case when we would have more than threshold entries? every connection would cause an iteration of over 100k entries
maybe retain_recent
should not be called more often than 10ms for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that we might be in situation when we have > 100k IP addresses in the map and all of these connections are recent (<60s) so we will go over this dashmap over and over without removing enough connections to be below 100k? Or this situation is only hypothetical because having ~100k IP addresses in the last minute is almost impossible event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that we might be in situation when we have > 100k IP addresses in the map and all of these connections are recent (<60s) so we will go over this dashmap over and over without removing enough connections to be below 100k?
Yes, this looks correct, except that we also have a check above that does not allow more than 150k connections per 60s (based on TOTAL_CONNECTIONS_PER_SECOND = 2500).
Or this situation is only hypothetical because having ~100k IP addresses in the last minute is almost impossible event?
Not necessary to have 100k IP's, node allows multiple connections from 1 IP, 8 iirc? Still sounds hard but possible?
Problem
A client can be abusive and create the connections too fast to over load the server. Even we have per connection limit, it involves more heavy operations like taking lock in the connection table and evicting other connections. Also, a lot of different clients can collectively create too many connections too fast to overwhelm the server. This is observed especially in around the the time when the node becomes a leader.
Summary of Changes
Introduce connection rate limiter.
Limit connection rates from a single IP to 8/minutes
Limit the global connection rate to 2500/second -- 2500 is estimated from the default connection cache table size which is generous.
Fixes #