Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various cleanup and fixes #302

Closed
wants to merge 1 commit into from
Closed

Various cleanup and fixes #302

wants to merge 1 commit into from

Conversation

edaniels
Copy link
Member

Description

I've spent the past week debugging what's going on with throughput and stalling when sending a lot of small and large data through SCTP. I have draft work in my fork that adds the modified Nagle's algorithm as well as a missing sender side implementation of addressing Silly Window Syndrome (this is a MUST in the RFC). Both of these help considerably on increasing throughput and stability but I need to verify it doesn't cause any unexpected regressions on throughput before moving forward. They also are hiding what I think may be some bug in this library that causes no progress to be made, even when RTXs are happening, so I want to solve that first.

Changes

  • Add comments referring back to RFCs
  • Allow associations to be named
  • Fix some typos
  • Fix a bug where we unmarshal a packet twice, doubling it's size when we are doing mandatory checksumming
  • Add some more stats for debugging
  • Make useZeroChecksum atomic
  • Add locks to queues to avoid caught races

@edaniels edaniels requested review from enobufs and Sean-Der February 24, 2024 00:07
@edaniels edaniels force-pushed the fixes3 branch 5 times, most recently from 7e59d29 to d2282c1 Compare February 24, 2024 00:18
Copy link

codecov bot commented Feb 24, 2024

Codecov Report

Attention: Patch coverage is 93.59606% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 81.17%. Comparing base (5359da5) to head (224c60f).

Files Patch % Lines
association.go 90.72% 7 Missing and 2 partials ⚠️
payload_queue.go 90.24% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #302      +/-   ##
==========================================
+ Coverage   80.61%   81.17%   +0.55%     
==========================================
  Files          49       49              
  Lines        4121     4254     +133     
==========================================
+ Hits         3322     3453     +131     
- Misses        653      656       +3     
+ Partials      146      145       -1     
Flag Coverage Δ
go 81.17% <93.59%> (+0.55%) ⬆️
wasm 67.55% <82.26%> (+0.65%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@edaniels edaniels force-pushed the fixes3 branch 4 times, most recently from dd83fe5 to 5384b60 Compare February 24, 2024 00:41
@enobufs
Copy link
Member

enobufs commented Feb 28, 2024

I've spent the past week debugging what's going on with throughput and stalling when sending a lot of small and large data through SCTP

What was the cause of stalling?

@enobufs
Copy link
Member

enobufs commented Feb 28, 2024

Add locks to queues to avoid caught races

I see you added mutex to payloadQueue and reassemblyQueue. Were these really necessary when we already have mutex at Association (and also Stream) level? Mutexes are not computationally free. (Where were the data races?)

@edaniels
Copy link
Member Author

I've spent the past week debugging what's going on with throughput and stalling when sending a lot of small and large data through SCTP

What was the cause of stalling?

Still undetermined and need a consistent repro but where I left off was the reassembly queue getting SSNs that are less than the ones it currently expects since too many messages were sent and the SSN overflowed.

@edaniels
Copy link
Member Author

Add locks to queues to avoid caught races

I see you added mutex to payloadQueue and reassemblyQueue. Were these really necessary when we already have mutex at Association (and also Stream) level? Mutexes are not computationally free. (Where were the data races?)

Running with -race wold have these occasionally fail. Do you think I should go back and try to use the locks already in the association/streams? Although not computationally free, the I think the cost is similar to our use of sync.Cond/channels for the wait/signal/broadcast system on reads and writes.

@edaniels
Copy link
Member Author

@enobufs, it seems like a good idea if I just split this PR up into more agreeable chunks so that we can discuss changes individually. I'll do that soon!

@edaniels edaniels closed this Feb 28, 2024
@edaniels
Copy link
Member Author

@enobufs good call on the mutexes. Whatever messing around I was doing was causing races to happen but I cannot get them to reproduce, so I'm excluding those in other PRs!

Copy link
Member

@enobufs enobufs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment I forgot to hit 'submit' ...

@@ -177,7 +187,7 @@ type Association struct {
cumulativeTSNAckPoint uint32
advancedPeerTSNAckPoint uint32
useForwardTSN bool
useZeroChecksum bool
useZeroChecksum uint32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are moving to go 1.19, we should use atomic.Bool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants