sleep instead of drop when stream rate exceeded limit; #939

lijunwangs · 2024-04-20T01:37:36Z

Problem

Currently we are dropping the stream right away when the stream rate limit is being exceeded. The problem is the client can retry right away. In addition, the stream might have been received and can be used, dropping it waste resources.

Summary of Changes

Change to sleep instead of drop. Thanks to @alessandrod -- we sleep based on the estimated time to elapse to appease the rate limit and consider the RTT for the connection.

Fixes #

mergify · 2024-04-20T02:01:02Z

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

mergify · 2024-04-20T02:01:02Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

alessandrod

lgtm, but someone else should review since I wrote some of the original code

alessandrod · 2024-04-20T02:35:27Z

streamer/src/nonblocking/stream_throttle.rs

+    pub(crate) fn reset_throttling_params_if_needed(&self) -> tokio::time::Instant {
+        let last_throttling_instant = *self.last_throttling_instant.read().unwrap();
+        if tokio::time::Instant::now().duration_since(last_throttling_instant)
+            > STREAM_THROTTLING_INTERVAL
        {
            let mut last_throttling_instant = self.last_throttling_instant.write().unwrap();


It was already the case before this PR so doesn't block this PR, but this is
racy: we're doing a check with a read lock, then dropping and acquiring a write
lock. Multiple threads can find that they're > THROTTLING_INTERVAL with the read
lock and update the last throttling instant.

I think that this connection-wide limit can be enforced without locks. Every connection task keeps
its last_throttling_instant and stream_count. The max streams allowed per connection is scaled by
the number of connections opened by a peer, which can be passed as an atomic to each connection
task.

I created a separate issue to address as it is orthogonal to this PR.

It should not be racy, as we check again for the timeout after acquiring the write lock. Only one thread will acquire the write lock at a time, and the first one to acquire will update the last_throttling_instant.

codecov-commenter · 2024-04-20T02:53:39Z

Codecov Report

Attention: Patch coverage is 45.00000% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (b745dc9) to head (28469a9).
Report is 7 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master     #939     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         853      853             
  Lines      231955   231963      +8     
=========================================
+ Hits       189984   189990      +6     
- Misses      41971    41973      +2

pgarg66 · 2024-04-20T21:55:43Z

streamer/src/nonblocking/quic.rs

+                        // left of this read interval so the peer backs off.
+                        let throttle_duration = STREAM_THROTTLING_INTERVAL
+                            .saturating_sub(throttle_interval_start.elapsed())
+                            .saturating_sub(connection.rtt());


Can we explain in a comment why we are subtracting connectio.rtt()? It will help with future maintainability of the code.

The comment could say something like:

// ... so the peer backs off. // // We subtract `connection.rtt()` to the time we sleep to minimize drift. That's the estimated amount // of time needed for us to send credits back to the peer, and for the peer to send the next chunk of // data so that it falls near the start of the next interval window.

What's average value for rtt. Wondering if this will reduce the throttling window enough that throttling never happens.

It's per connection and the quic protocol describes how to compute and update it. It's meant to be a good estimate of the actual rtt and is used internally by quinn to piggyback control frames to minimize overhead. It can only be way off if there's a bug in quinn, but at that point, many other things wouldn't work as intended anyway.

Thinking more about this, I think the sleep value should be the amount of interval which will make this packet itself to satisfy the rate limit. And I do not think we need to think about the RTT which is for the next packet. And the value of the sleep of STREAM_THROTTLING_INTERVAL - the elapse of the interval is also questionable. Maybe that is too large actually. From the math we should be able to figure out the amount of the time to sleep to satisfy the rate limit requirement.

The consideration of RTT makes sense only when you assume the streams are sent sequentially. It is not true, the client can send the streams simultaneously. While we think it will take a RTT to get the next stream, the next stream might have already arrived. I think the logic is simpler just consider this stream which is causing the violation -- not do the optimization for the next stream which we do not know when it might come.

For a well behaving client it would be ok. I am wondering if a malicious client can implement/hack a custom quic client that causes high rtt but also has high stream count. A simple mitigation would be to cap the rtt value used in this calculation.

I'll check the spec, but yeah you're right in principle a malicious client could intentionally game the rtt to make his connection more bursty. Not sure what a good cap for rtt could be.

@lijunwangs

While we think it will take a RTT to get the next stream, the next stream might have already arrived

No, because by sleeping if the peer keeps sending the receive window will fill up, which is the whole point of sleeping. So the next stream the peer sends will be after our sleeping time. If in the meantime there are queued streams within the receive window that's fine.

Not sure what a good cap for rtt could be.

I say for the purpose of landing this soon, let's remove rtt as a variable. Then we can play with taking rtt in consideration for computing the receive window and/or the throttle time and we can let that ride the trains.

pgarg66 · 2024-04-22T19:15:20Z

LGTM. It has a merge conflict right now.

Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point

lijunwangs · 2024-04-22T23:35:11Z

LGTM. It has a merge conflict right now.

I have rebased the master and addressed the conflicts. @pgarg66

t-nelson · 2024-04-23T04:21:10Z

no blockers here, right?

* sleep instead of drop when stream rate exceeded limit; Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point * remove connection count related changes -- they are unrelated to this PR * revert unintended changes (cherry picked from commit 137a982) # Conflicts: # streamer/src/nonblocking/quic.rs # streamer/src/nonblocking/stream_throttle.rs

* sleep instead of drop when stream rate exceeded limit; Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point * remove connection count related changes -- they are unrelated to this PR * revert unintended changes (cherry picked from commit 137a982)

…rt of #939) (#990) sleep instead of drop when stream rate exceeded limit; (#939) * sleep instead of drop when stream rate exceeded limit; Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point * remove connection count related changes -- they are unrelated to this PR * revert unintended changes (cherry picked from commit 137a982) Co-authored-by: Lijun Wang <[email protected]>

…rt of anza-xyz#939) (anza-xyz#990) sleep instead of drop when stream rate exceeded limit; (anza-xyz#939) * sleep instead of drop when stream rate exceeded limit; Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point * remove connection count related changes -- they are unrelated to this PR * revert unintended changes (cherry picked from commit 137a982) Co-authored-by: Lijun Wang <[email protected]>

alessandrod added v1.17 v1.18 labels Apr 20, 2024

alessandrod reviewed Apr 20, 2024

View reviewed changes

lijunwangs requested a review from pgarg66 April 20, 2024 19:50

lijunwangs mentioned this pull request Apr 20, 2024

quic streamer: consider removing lock around stream rate limiter #942

Open

pgarg66 reviewed Apr 20, 2024

View reviewed changes

lijunwangs requested review from ryleung-solana and behzadnouri April 21, 2024 08:16

lijunwangs force-pushed the throttle_client_by_sleep_not_drop branch from 469b738 to 112af05 Compare April 22, 2024 20:08

lijunwangs added 3 commits April 22, 2024 14:58

sleep instead of drop when stream rate exceeded limit;

0b74480

Consider connection count of staked nodes when calculating allowed PPS remove rtt from throttle_duration calculation removed connection count in StreamerCounter -- we do not need it at this point

remove connection count related changes -- they are unrelated to this PR

f94bde2

revert unintended changes

28469a9

lijunwangs force-pushed the throttle_client_by_sleep_not_drop branch from ea3008a to 28469a9 Compare April 22, 2024 21:58

pgarg66 approved these changes Apr 23, 2024

View reviewed changes

lijunwangs merged commit 137a982 into anza-xyz:master Apr 23, 2024
38 checks passed

mergify bot mentioned this pull request Apr 23, 2024

v1.17: sleep instead of drop when stream rate exceeded limit; (backport of #939) #989

Closed

mergify bot mentioned this pull request Apr 23, 2024

v1.18: sleep instead of drop when stream rate exceeded limit; (backport of #939) #990

Merged

lijunwangs mentioned this pull request Apr 23, 2024

Backport connection stream counter to v1.17 #991

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sleep instead of drop when stream rate exceeded limit; #939

sleep instead of drop when stream rate exceeded limit; #939

lijunwangs commented Apr 20, 2024

mergify bot commented Apr 20, 2024

mergify bot commented Apr 20, 2024

alessandrod left a comment

alessandrod Apr 20, 2024

lijunwangs Apr 20, 2024

lijunwangs Apr 20, 2024

pgarg66 Apr 20, 2024 •

edited

Loading

codecov-commenter commented Apr 20, 2024 •

edited

Loading

pgarg66 Apr 20, 2024

alessandrod Apr 20, 2024

pgarg66 Apr 21, 2024

alessandrod Apr 21, 2024

lijunwangs Apr 21, 2024

lijunwangs Apr 21, 2024

alessandrod Apr 21, 2024

alessandrod Apr 21, 2024

alessandrod Apr 21, 2024

lijunwangs Apr 22, 2024

pgarg66 commented Apr 22, 2024

lijunwangs commented Apr 22, 2024

t-nelson commented Apr 23, 2024

sleep instead of drop when stream rate exceeded limit; #939

sleep instead of drop when stream rate exceeded limit; #939

Conversation

lijunwangs commented Apr 20, 2024

Problem

Summary of Changes

mergify bot commented Apr 20, 2024

mergify bot commented Apr 20, 2024

alessandrod left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgarg66 Apr 20, 2024 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Apr 20, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgarg66 commented Apr 22, 2024

lijunwangs commented Apr 22, 2024

t-nelson commented Apr 23, 2024

pgarg66 Apr 20, 2024 •

edited

Loading

codecov-commenter commented Apr 20, 2024 •

edited

Loading