-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MpscLinkedQueue GC issues #7799
Conversation
Handle empty queue first, then share most of the implementation for non-empty scenarios (spin and non-spin).
Similar to https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscLinkedQueue.java#L120, null out the next pointer in the discarded consumer node when polling from the queue. If not, we leave behind a (potentially long) chain of connected garbage nodes. If we're unlucky (for example one of the early nodes is promoted to old generation, triggering nepotism), this can cause GC issues as now we have a long linked list which must be marked by young collections. Reproducer: ``` import io.reactivex.rxjava3.internal.queue.MpscLinkedQueue; public class MpscLinkedQueueGC { public static void main(String[] args) { MpscLinkedQueue<Integer> queue = new MpscLinkedQueue<>(); for (int i = 0; i < 10; i++) System.gc(); // tenure consumer node while (true) { queue.offer(123); queue.poll(); } } } ``` ``` Before fix: $ java -Xlog:gc -Xmx1G -cp build/classes/java/main MpscLinkedQueueGC.java ... [1.261s] GC(20) Pause Young (Normal) (G1 Preventive Collection) 115M->115M(204M) 209.335ms [1.385s] GC(23) Pause Young (Normal) (G1 Evacuation Pause) 148M->149M(204M) 31.491ms [1.417s] GC(24) Pause Young (Normal) (G1 Evacuation Pause) 157M->158M(204M) 19.333ms [1.453s] GC(25) Pause Young (Normal) (G1 Evacuation Pause) 166M->167M(599M) 22.678ms [1.966s] GC(26) Pause Young (Normal) (G1 Evacuation Pause) 249M->249M(497M) 305.238ms ... After fix: $ java -Xlog:gc -Xmx1G -cp build/classes/java/main MpscLinkedQueueGC.java ... [1.169s] GC(14) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.755ms [1.558s] GC(15) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.689ms [1.948s] GC(16) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.800ms [2.337s] GC(17) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.714ms ... ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please only fix the GC issue.
This reverts commit 4c6c6fa.
This reverts commit b8582b9.
Similar to https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscLinkedQueue.java#L120, null out the next pointer in the discarded consumer node when polling from the queue. If not, we leave behind a (potentially long) chain of connected garbage nodes. If we're unlucky (for example one of the early nodes is promoted to old generation, triggering nepotism), this can cause GC issues as now we have a long linked list which must be marked by young collections. Reproducer: ``` import io.reactivex.rxjava3.internal.queue.MpscLinkedQueue; public class MpscLinkedQueueGC { public static void main(String[] args) { MpscLinkedQueue<Integer> queue = new MpscLinkedQueue<>(); for (int i = 0; i < 10; i++) System.gc(); // tenure consumer node while (true) { queue.offer(123); queue.poll(); } } } ``` ``` Before fix: $ java -Xlog:gc -Xmx1G -cp build/classes/java/main MpscLinkedQueueGC.java ... [1.261s] GC(20) Pause Young (Normal) (G1 Preventive Collection) 115M->115M(204M) 209.335ms [1.385s] GC(23) Pause Young (Normal) (G1 Evacuation Pause) 148M->149M(204M) 31.491ms [1.417s] GC(24) Pause Young (Normal) (G1 Evacuation Pause) 157M->158M(204M) 19.333ms [1.453s] GC(25) Pause Young (Normal) (G1 Evacuation Pause) 166M->167M(599M) 22.678ms [1.966s] GC(26) Pause Young (Normal) (G1 Evacuation Pause) 249M->249M(497M) 305.238ms ... After fix: $ java -Xlog:gc -Xmx1G -cp build/classes/java/main MpscLinkedQueueGC.java ... [1.169s] GC(14) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.755ms [1.558s] GC(15) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.689ms [1.948s] GC(16) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.800ms [2.337s] GC(17) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.714ms ... ```
Updated with the minimal fix. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## 3.x #7799 +/- ##
============================================
- Coverage 99.62% 99.59% -0.03%
- Complexity 6801 6803 +2
============================================
Files 752 752
Lines 47707 47713 +6
Branches 6401 6402 +1
============================================
- Hits 47527 47519 -8
- Misses 84 89 +5
- Partials 96 105 +9 ☔ View full report in Codecov by Sentry. |
in reactor core, which refs to self. currConsumerNode.soNext(currConsumerNode); |
Hmm, thanks @He-Pin. There's a comment there:
I don't fully understand the implications here. Do you think setting null introduces a bug from someone thinking this is the producer node? |
In JCTools, indeed it is set to self. |
I've looked into our simple implementation and null should be fine. We only care about the currentConsumerNode in poll and a successful poll will lose that old node with whatever next pointer, null or self. In any case, I prepared #7801 to act quickly and apply the same logic as in JCTools. |
Thanks! (My original inspiration was https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscLinkedQueue.java#L120 which uses null, for what it's worth). |
Similar to https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/ MpscLinkedQueue.java#L120, null out the next pointer in the discarded consumer node when polling from the queue. If not, we leave behind a (potentially long) chain of connected garbage nodes. If we're unlucky (for example one of the early nodes is promoted to old generation, triggering nepotism), this can cause GC issues as now we have a long linked list which must be marked by young collections.
I noticed this in one of my applications, a heap dump showed an unreachable list of a few hundred thousand nodes all with null values.
Note: There are two commits here. One refactors poll() to (IMO) simplify the different cases, and the second actually fixes the GC issue. If preferred I can just fix the GC issue without the refactoring.Reproducer: