-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sleep instead of drop when stream rate exceeded limit; #939
sleep instead of drop when stream rate exceeded limit; #939
Conversation
Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. |
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but someone else should review since I wrote some of the original code
pub(crate) fn reset_throttling_params_if_needed(&self) -> tokio::time::Instant { | ||
let last_throttling_instant = *self.last_throttling_instant.read().unwrap(); | ||
if tokio::time::Instant::now().duration_since(last_throttling_instant) | ||
> STREAM_THROTTLING_INTERVAL | ||
{ | ||
let mut last_throttling_instant = self.last_throttling_instant.write().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was already the case before this PR so doesn't block this PR, but this is
racy: we're doing a check with a read lock, then dropping and acquiring a write
lock. Multiple threads can find that they're > THROTTLING_INTERVAL with the read
lock and update the last throttling instant.
I think that this connection-wide limit can be enforced without locks. Every connection task keeps
its last_throttling_instant and stream_count. The max streams allowed per connection is scaled by
the number of connections opened by a peer, which can be passed as an atomic to each connection
task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a separate issue to address as it is orthogonal to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be racy, as we check again for the timeout after acquiring the write lock. Only one thread will acquire the write lock at a time, and the first one to acquire will update the last_throttling_instant
.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #939 +/- ##
=========================================
- Coverage 81.9% 81.9% -0.1%
=========================================
Files 853 853
Lines 231955 231963 +8
=========================================
+ Hits 189984 189990 +6
- Misses 41971 41973 +2 |
streamer/src/nonblocking/quic.rs
Outdated
// left of this read interval so the peer backs off. | ||
let throttle_duration = STREAM_THROTTLING_INTERVAL | ||
.saturating_sub(throttle_interval_start.elapsed()) | ||
.saturating_sub(connection.rtt()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explain in a comment why we are subtracting connectio.rtt()
? It will help with future maintainability of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment could say something like:
// ... so the peer backs off.
//
// We subtract `connection.rtt()` to the time we sleep to minimize drift. That's the estimated amount
// of time needed for us to send credits back to the peer, and for the peer to send the next chunk of
// data so that it falls near the start of the next interval window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's average value for rtt. Wondering if this will reduce the throttling window enough that throttling never happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's per connection and the quic protocol describes how to compute and update it. It's meant to be a good estimate of the actual rtt and is used internally by quinn to piggyback control frames to minimize overhead. It can only be way off if there's a bug in quinn, but at that point, many other things wouldn't work as intended anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, I think the sleep value should be the amount of interval which will make this packet itself to satisfy the rate limit. And I do not think we need to think about the RTT which is for the next packet. And the value of the sleep of STREAM_THROTTLING_INTERVAL - the elapse of the interval is also questionable. Maybe that is too large actually. From the math we should be able to figure out the amount of the time to sleep to satisfy the rate limit requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consideration of RTT makes sense only when you assume the streams are sent sequentially. It is not true, the client can send the streams simultaneously. While we think it will take a RTT to get the next stream, the next stream might have already arrived. I think the logic is simpler just consider this stream which is causing the violation -- not do the optimization for the next stream which we do not know when it might come.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a well behaving client it would be ok. I am wondering if a malicious client can implement/hack a custom quic client that causes high rtt but also has high stream count. A simple mitigation would be to cap the rtt value used in this calculation.
I'll check the spec, but yeah you're right in principle a malicious client could intentionally game the rtt to make his connection more bursty. Not sure what a good cap for rtt could be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we think it will take a RTT to get the next stream, the next stream might have already arrived
No, because by sleeping if the peer keeps sending the receive window will fill up, which is the whole point of sleeping. So the next stream the peer sends will be after our sleeping time. If in the meantime there are queued streams within the receive window that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what a good cap for rtt could be.
I say for the purpose of landing this soon, let's remove rtt as a variable. Then we can play with taking rtt in consideration for computing the receive window and/or the throttle time and we can let that ride the trains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
LGTM. It has a merge conflict right now. |
469b738
to
112af05
Compare
Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point
ea3008a
to
28469a9
Compare
I have rebased the master and addressed the conflicts. @pgarg66 |
no blockers here, right? |
* sleep instead of drop when stream rate exceeded limit; Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point * remove connection count related changes -- they are unrelated to this PR * revert unintended changes (cherry picked from commit 137a982) # Conflicts: # streamer/src/nonblocking/quic.rs # streamer/src/nonblocking/stream_throttle.rs
* sleep instead of drop when stream rate exceeded limit; Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point * remove connection count related changes -- they are unrelated to this PR * revert unintended changes (cherry picked from commit 137a982)
…rt of #939) (#990) sleep instead of drop when stream rate exceeded limit; (#939) * sleep instead of drop when stream rate exceeded limit; Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point * remove connection count related changes -- they are unrelated to this PR * revert unintended changes (cherry picked from commit 137a982) Co-authored-by: Lijun Wang <[email protected]>
…rt of anza-xyz#939) (anza-xyz#990) sleep instead of drop when stream rate exceeded limit; (anza-xyz#939) * sleep instead of drop when stream rate exceeded limit; Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point * remove connection count related changes -- they are unrelated to this PR * revert unintended changes (cherry picked from commit 137a982) Co-authored-by: Lijun Wang <[email protected]>
Problem
Currently we are dropping the stream right away when the stream rate limit is being exceeded. The problem is the client can retry right away. In addition, the stream might have been received and can be used, dropping it waste resources.
Summary of Changes
Change to sleep instead of drop. Thanks to @alessandrod -- we sleep based on the estimated time to elapse to appease the rate limit and consider the RTT for the connection.
Fixes #