Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MpscLinkedQueue GC issues #7799

Merged
merged 5 commits into from
Nov 22, 2024

Conversation

olivergillespie
Copy link
Contributor

@olivergillespie olivergillespie commented Nov 21, 2024

Similar to https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/ MpscLinkedQueue.java#L120, null out the next pointer in the discarded consumer node when polling from the queue. If not, we leave behind a (potentially long) chain of connected garbage nodes. If we're unlucky (for example one of the early nodes is promoted to old generation, triggering nepotism), this can cause GC issues as now we have a long linked list which must be marked by young collections.

I noticed this in one of my applications, a heap dump showed an unreachable list of a few hundred thousand nodes all with null values.

Note: There are two commits here. One refactors poll() to (IMO) simplify the different cases, and the second actually fixes the GC issue. If preferred I can just fix the GC issue without the refactoring.

Reproducer:

import io.reactivex.rxjava3.internal.queue.MpscLinkedQueue;

public class MpscLinkedQueueGC {
    public static void main(String[] args) {
        MpscLinkedQueue<Integer> queue = new MpscLinkedQueue<>();
        for (int i = 0; i < 10; i++) System.gc(); // tenure consumer node
        while (true) {
            queue.offer(123);
            queue.poll();
        }
    }
}
Before fix:
$ java -Xlog:gc -Xmx1G -cp build/classes/java/main MpscLinkedQueueGC.java
...
[1.261s] GC(20) Pause Young (Normal) (G1 Preventive Collection) 115M->115M(204M) 209.335ms
[1.385s] GC(23) Pause Young (Normal) (G1 Evacuation Pause) 148M->149M(204M) 31.491ms
[1.417s] GC(24) Pause Young (Normal) (G1 Evacuation Pause) 157M->158M(204M) 19.333ms
[1.453s] GC(25) Pause Young (Normal) (G1 Evacuation Pause) 166M->167M(599M) 22.678ms
[1.966s] GC(26) Pause Young (Normal) (G1 Evacuation Pause) 249M->249M(497M) 305.238ms
...

After fix:
$ java -Xlog:gc -Xmx1G -cp build/classes/java/main MpscLinkedQueueGC.java
...
[1.169s] GC(14) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.755ms
[1.558s] GC(15) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.689ms
[1.948s] GC(16) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.800ms
[2.337s] GC(17) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.714ms
...

Handle empty queue first, then share most of the
implementation for non-empty scenarios (spin and
non-spin).
Similar to
https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscLinkedQueue.java#L120,
null out the next pointer in the discarded consumer node
when polling from the queue. If not, we leave behind a (potentially long)
chain of connected garbage nodes. If we're unlucky (for example one of
the early nodes is promoted to old generation, triggering nepotism),
this can cause GC issues as now we have a long linked list which must be
marked by young collections.

Reproducer:

```
import io.reactivex.rxjava3.internal.queue.MpscLinkedQueue;

public class MpscLinkedQueueGC {
    public static void main(String[] args) {
        MpscLinkedQueue<Integer> queue = new MpscLinkedQueue<>();
        for (int i = 0; i < 10; i++) System.gc(); // tenure consumer node
        while (true) {
            queue.offer(123);
            queue.poll();
        }
    }
}
```

```
Before fix:

$ java -Xlog:gc -Xmx1G -cp build/classes/java/main MpscLinkedQueueGC.java
...
[1.261s] GC(20) Pause Young (Normal) (G1 Preventive Collection) 115M->115M(204M) 209.335ms
[1.385s] GC(23) Pause Young (Normal) (G1 Evacuation Pause) 148M->149M(204M) 31.491ms
[1.417s] GC(24) Pause Young (Normal) (G1 Evacuation Pause) 157M->158M(204M) 19.333ms
[1.453s] GC(25) Pause Young (Normal) (G1 Evacuation Pause) 166M->167M(599M) 22.678ms
[1.966s] GC(26) Pause Young (Normal) (G1 Evacuation Pause) 249M->249M(497M) 305.238ms
...

After fix:
$ java -Xlog:gc -Xmx1G -cp build/classes/java/main MpscLinkedQueueGC.java
...
[1.169s] GC(14) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.755ms
[1.558s] GC(15) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.689ms
[1.948s] GC(16) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.800ms
[2.337s] GC(17) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.714ms
...
```
Copy link
Member

@akarnokd akarnokd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please only fix the GC issue.

Similar to
https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscLinkedQueue.java#L120,
null out the next pointer in the discarded consumer node
when polling from the queue. If not, we leave behind a (potentially long)
chain of connected garbage nodes. If we're unlucky (for example one of
the early nodes is promoted to old generation, triggering nepotism),
this can cause GC issues as now we have a long linked list which must be
marked by young collections.

Reproducer:

```
import io.reactivex.rxjava3.internal.queue.MpscLinkedQueue;

public class MpscLinkedQueueGC {
    public static void main(String[] args) {
        MpscLinkedQueue<Integer> queue = new MpscLinkedQueue<>();
        for (int i = 0; i < 10; i++) System.gc(); // tenure consumer node
        while (true) {
            queue.offer(123);
            queue.poll();
        }
    }
}
```

```
Before fix:

$ java -Xlog:gc -Xmx1G -cp build/classes/java/main MpscLinkedQueueGC.java
...
[1.261s] GC(20) Pause Young (Normal) (G1 Preventive Collection) 115M->115M(204M) 209.335ms
[1.385s] GC(23) Pause Young (Normal) (G1 Evacuation Pause) 148M->149M(204M) 31.491ms
[1.417s] GC(24) Pause Young (Normal) (G1 Evacuation Pause) 157M->158M(204M) 19.333ms
[1.453s] GC(25) Pause Young (Normal) (G1 Evacuation Pause) 166M->167M(599M) 22.678ms
[1.966s] GC(26) Pause Young (Normal) (G1 Evacuation Pause) 249M->249M(497M) 305.238ms
...

After fix:
$ java -Xlog:gc -Xmx1G -cp build/classes/java/main MpscLinkedQueueGC.java
...
[1.169s] GC(14) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.755ms
[1.558s] GC(15) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.689ms
[1.948s] GC(16) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.800ms
[2.337s] GC(17) Pause Young (Normal) (G1 Evacuation Pause) 304M->2M(506M) 0.714ms
...
```
@olivergillespie
Copy link
Contributor Author

Updated with the minimal fix.

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.59%. Comparing base (9b55d01) to head (898c19d).
Report is 113 commits behind head on 3.x.

Additional details and impacted files
@@             Coverage Diff              @@
##                3.x    #7799      +/-   ##
============================================
- Coverage     99.62%   99.59%   -0.03%     
- Complexity     6801     6803       +2     
============================================
  Files           752      752              
  Lines         47707    47713       +6     
  Branches       6401     6402       +1     
============================================
- Hits          47527    47519       -8     
- Misses           84       89       +5     
- Partials         96      105       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@akarnokd akarnokd merged commit ede5cfc into ReactiveX:3.x Nov 22, 2024
5 checks passed
@He-Pin
Copy link
Contributor

He-Pin commented Nov 26, 2024

in reactor core, which refs to self.

			currConsumerNode.soNext(currConsumerNode);

@olivergillespie olivergillespie deleted the mpsc-queue-gc-issues branch November 26, 2024 11:46
@olivergillespie
Copy link
Contributor Author

in reactor core, which refs to self.

			currConsumerNode.soNext(currConsumerNode);

Hmm, thanks @He-Pin. There's a comment there:

// Fix up the next ref of currConsumerNode to prevent promoted nodes from keeping new ones alive.
// We use a reference to self instead of null because null is already a meaningful value (the next of
// producer node is null).

I don't fully understand the implications here. Do you think setting null introduces a bug from someone thinking this is the producer node?

@akarnokd
Copy link
Member

@akarnokd
Copy link
Member

I've looked into our simple implementation and null should be fine.

We only care about the currentConsumerNode in poll and a successful poll will lose that old node with whatever next pointer, null or self.

In any case, I prepared #7801 to act quickly and apply the same logic as in JCTools.

@olivergillespie
Copy link
Contributor Author

Thanks! (My original inspiration was https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscLinkedQueue.java#L120 which uses null, for what it's worth).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants