LambdaForm methods inlining #12162

liqunl · 2021-03-09T20:48:56Z

This commit adds support to inline LambdaForm methods used in OpenJDK
MethodHandle implementation. It includes changes to inlining heuristic
to treat LambdaForm methods in the same way as thunk archetype, also
changes to support emulating interpreter execution of LambdaForm methods
to find call sites to inline.

The InterpreterEmulator change includes:

Support iterating more bytecodes. InterpreterEmulator was originally
designed for thunk archetypes, which only uses a limited number of
bytecodes. However, LambdaForm can contain any bytecodes. This commit
adds support to bytecodes that are commonly used in LambdaForm methods.
The support for all bytecodes will be done separately.
Track object info in local slots. LambdaForm methods use local
variables to store intermediate results that are used as call arguments
later. Since the refining of MethodHandle INL (invokeBasic, linkTo*)
requires object info, we have to track the object info stored in local
variables.
Prex arg info propagation to callee in with state mode. Currently,
prex arg info propagation is done for non-static methods and when
peeking is done. However, LambdaForm methods are static. Since we track
the operand stack state, we can create prex arg info from operand stack
and propagate it down to callee.
Add new operand classes corresponding to PrexArgument.
Add merge functions on operand to support merging local slot states
and operand stack states. We have to merge information coming from
different blocks to continue tracking operand stack state and local
slots state.

This commit also fixes the following problem:

MutableCallSiteOperand extends KnownObjOperand, which is not right. As
knownObjOperand doesn't rely on any assumption, but
MutableCallSiteOperand represents a known object only when the call site
target remain unchanged. InterpeterEmulator does a few operations on
KnownObjOperand which shouldn't be applied to MutableCallSiteOperand
(for example creating known object for final fields of known objects).

Part of #10618

Signed-off-by: Liqun Liu [email protected]

liqunl · 2021-03-09T20:58:14Z

Depends on #11634.
I added commits that this PR is dependent on from #11634. Will remove them before merge

liqunl · 2021-03-09T21:30:25Z

@0xdaryl @vijaysun-omr May I ask for your review?

vijaysun-omr · 2021-03-11T23:30:06Z

runtime/compiler/optimizer/InlinerTempForJ9.cpp

@@ -4224,13 +4229,18 @@ TR_MultipleCallTargetInliner::exceedsSizeThreshold(TR_CallSite *callSite, int by
     // HACK: Get frequency from both sources, and use both.  You're
     // only cold if you're cold according to both.

+     bool isLambdaFormGeneratedMethod = comp()->fej9()->isLambdaFormGeneratedMethod(callerResolvedMethod);
+     // TODO: we should ignore frequency for thunk archetype, however, this require performance evaluation
+     bool frequencyIsInaccurate = isLambdaFormGeneratedMethod;


Could you please explain this further ? i.e. why is the frequency inaccurate for lambda form methods ?

LambdaForm in a MethodHandle can change. The change usually happen when customization happens, i.e. when a MethodHandle has been executed frequent enough, a customized LambdaForm will replace the existing LambdaForm. The frequency information of the original LambdaForm is not inherited to the customized one, thus the customized one might seem cold when it's compiled or inlined.

vijaysun-omr · 2021-03-12T23:06:13Z

Given the large amount of changes to the interpreter emulator, I would like to take a step back and ask how this code is getting tested in general.

I would also like us to consider what are some bugs that have been hit in the past, so that we can think about whether we could have variants of those kinds of problems with these changes.

liqunl · 2021-03-30T14:47:03Z

Update: I've done performance measurement with this change on jruby and nashorn, the performance is on par with baseline. I'm now looking at a failure on zlinux.

vijaysun-omr · 2021-03-31T13:43:16Z

Thanks @liqunl could you please share more details about the performance experiments you alluded to in your prior comment ?

liqunl · 2021-04-01T15:22:39Z

Actually, there is a few benchmarks with regression, I mistaken higher score as better for them. Overall, the change is helping or not hurting. I'll look at those with regression. For jruby, the benchmark I ran is https://github.com/headius/bench2018, the one that we investigated before.

red_black
performance (LOWER is better)
config	baseline	inlineMH
mean	1.01	1.2(-16%)
std	0.02	0.04

aref
performance (HIGHER is better)
'Instantiation'
config	baseline	inlineMH
mean	23.58	31.05(+32%)
std	0.26	0.64

hash
performance (HIGHER is better)
'bm_hash_small4'
config	baseline	inlineMH
mean	202.64	216.99(+7%)
std	7.06	3.29
'bm_hash_long'
config	baseline	inlineMH
mean	81.77	80.34(-2%)
std	3.09	1.68
'bm_hash_small8'
config	baseline	inlineMH
mean	142.91	151.24(+6%)
std	5.19	8.3
'bm_bighash'
config	baseline	inlineMH
mean	223.13	234.86(+5%)
std	13.69	12.49
'bm_hash_small2'
config	baseline	inlineMH
mean	280.5	295.66(+5%)
std	11.37	2.1

mandelbrot
performance (LOWER is better)
config	baseline	inlineMH
mean	3.28	2.72(+21%)
std	0.07	0.06

kwargs
performance (LOWER is better)
config	baseline	inlineMH
mean	1.47	1.6(-8%)
std	0.03	0.05

weird_sort
performance (HIGHER is better)
'weird'
config	baseline	inlineMH
mean	11.8	11.07(-6%)
std	0.16	0.29
'normal'
config	baseline	inlineMH
mean	18.73	30.28(+62%)
std	0.37	1.38

For nashorn, it's the one provided by user in this issue #5371

nashorn
performance (LOWER is better)
config	baseline	inlineMH
closureTest	81760.180	145942.900 (-78.5%)
javaTest	394310.934	391017.628 (+0.8%)
loadJS	444556698.310	421764198.560 (+5%)
mathTest	70877351.707	69516350.532 (+2%)
objectTest	184565.074	198183.275 (-7%)
regexTest	19488951.290	19718357.701 (-1.2%)

Any difference within 10% is within the noise margin, ~~so we're seeing a significant improvement on closureTest~~ and the others are on par with baseline.

Edit: the nashorn number is the lower the better, so the closureTest has a large regression with the change, will look at this as well.

vijaysun-omr · 2021-04-01T20:14:22Z

Okay thanks @liqunl I agree with your prioritization of looking at the few regressions in this set of numbers first.

liqunl · 2021-04-08T19:00:18Z

@vijaysun-omr I've fixed the regression and the test failure. With this change, I'm seeing performance gain on all jruby benchmarks on OpenJ9 MethodHandle implementation. On nashorn benchmarks, the performance is on par with the baseline. Will put the numbers once I have all the runs finished. Could you review this PR?

vijaysun-omr · 2021-04-08T22:45:09Z

Thanks @liqunl. It is'nt completely clear to me what the final delta fix was that helped you attain the performance you mentioned in your last comment. Is it possible to describe and/or point me to that piece of code in this non trivial commit so that I don't have to go through the entire commit again ?

vijaysun-omr · 2021-04-08T22:46:10Z

Jenkins test sanity all jdk8,jdk11

liqunl · 2021-04-09T14:30:01Z

runtime/compiler/optimizer/InterpreterEmulator.cpp

+   // TODO: add code to record dead path and ignore it in object info propagation, enable
+   // the following code if branch folding is possible in LambdaForm methods
+   //
+   if (false && second && first)


@vijaysun-omr Here's the first change to attain the performance. The change in this PR will propagate state in local slots and operand stack in bytecode iterator's order. A block will get local slots and operand stack states from its predecessors, but this requires all the predecessors to be visited before the successor, otherwise we set the local slots and operand stack state of the block being visited to unknown, this will cause the failure to find a call target for a MethodHandle call as the target searching requires known MethodHandle object as the receiver of the call.

The disabled code here does branch folding, so we may skip interpreting a block's bytecodes, introducing blocks that are never visited, stopping the propagation of object info to their successors.

Similar branch folding exists in ILGen, so we don't gen bytecodes in dead path, this change won't result in more nodes being generated.

I see, thanks for the good explanation. That makes sense but not very obvious until you clarified :)

liqunl · 2021-04-09T14:45:50Z

runtime/compiler/optimizer/InterpreterEmulator.cpp

+   if (canFallThru)
+      {
+      debugTrace(tracer(), "maintainStackForIf canFallThrough to bcIndex=%d\n", fallThruBC);
+      genTarget(fallThruBC);


@vijaysun-omr This change is to gen fall through block first to make sure predecessor is interpreted before successor. This only helps in thunk archetype as each thunk archetype (except those of the leaf MethodHandle) contains customization logic which looks like the following, where the fall through is predecessor of the branch target.

if (ILGenMacros.isShareableThunk()) { undoCustomizationLogic(next); undoCustomizationLogic(filters); } if (!ILGenMacros.isCustomThunk()) { doCustomizationLogic(); }

However, this won't help when we have more complicated control flow, such as nested if statement. So a separate item will be created to do a reverse post order traversal at CFG level.

Okay that sounds important since it can reduce the effectiveness of the interpretation that the whole scheme depends on.
Can you please open that issue and make sure it is on the plan to be worked on by talking to Daryl ?

liqunl · 2021-04-09T14:55:05Z

runtime/compiler/optimizer/InterpreterEmulator.cpp

+   // TODO: customized lambda form method is similar, it may not
+   // be executed enough before it gets inlined
+   //
+   if (_callerIsThunkArchetype)


@vijaysun-omr Another problem with previous commit is that it refactored the code such that we only compute return value of non-cold calls. However, because thunk archetypes are never interpreted, so the call bytecodes always appear as cold. This includes the call to MutableCallSite.getTarget(), whose return value is used to guide MutableCallSite inlining. MutableCallSite inlining is broken because previous commit doesn't compute this call's value due to incorrect coldness info.

Because MutableCallSite.getTarget() is treated as cold in the baseline, it won't be inlined in the first inlining pass (will be inlined in targeted inlining only if a pass is requested), so the fix here will inline MutableCallSite.getTarget() in the first pass which gives us performance gain in jruby benchmarks.

Okay, this sounds critical and I doubt that filtering on cold paths in this code is the right way to go.

One problem is the check on isUnresolvedInCP. Since the thunk archetypes are never interpreted, and we always do compile time resolve on them, so the majority of its cp entries will appear unresolved. I think we should at least ignore isUnresolvedInCP for calls in thunk archetypes. @vijaysun-omr What do you think?

Yes that makes sense. Perhaps these routines from the cold block marker code help you think about and handle these sorts of cases :

https://github.com/eclipse/omr/blob/cf3ef5ce33a7993f5749463a17d39e4c8e6b6db0/compiler/optimizer/LocalOpts.cpp#L7924

and

https://github.com/eclipse/omr/blob/cf3ef5ce33a7993f5749463a17d39e4c8e6b6db0/compiler/optimizer/LocalOpts.cpp#L7986

liqunl · 2021-04-09T15:02:08Z

@vijaysun-omr I added a few comments to explain the changes I made to fix the regression in previous commit.

liqunl · 2021-04-09T15:03:34Z

runtime/compiler/optimizer/InterpreterEmulator.cpp

 void
 InterpreterEmulator::visitInvokevirtual()
   {
   int32_t cpIndex = next2Bytes();
   auto calleeMethod = (TR_ResolvedJ9Method*)_calltarget->_calleeMethod;
   bool isUnresolvedInCP;
-   TR_ResolvedMethod * resolvedMethod = calleeMethod->getResolvedPossiblyPrivateVirtualMethod(comp(), cpIndex, true, &isUnresolvedInCP);
+   // Calls in thunk archetype won't be executed by inliner, so they may appear as unresolved


Typo here, 'inliner' should be 'interpreter', will fix it before the merge.

liqunl · 2021-04-09T15:05:28Z

The Pull request failed of connection issue

stderr: fatal: unable to access 'https://github.com/eclipse/openj9.git/': Could not resolve host: github.com

liqunl · 2021-04-09T15:48:43Z

Perf numbers. This change on OpenJ9 MethodHandle implementation vs baseline

red_black
performance (LOWER is better)
config	baseline	inlineMH
mean	1.0	0.94(+6%)
std	0.02	0.03

aref
performance (HIGHER is better)
'Instantiation'
config	baseline	inlineMH
mean	23.03	36.8(+60%)
std	0.69	0.55

hash
performance (HIGHER is better)
'bm_hash_small4'
config	baseline	inlineMH
mean	209.9	217.7(+4%)
std	10.67	8.04
'bm_hash_long'
config	baseline	inlineMH
mean	80.31	83.83(+4%)
std	2.99	3.07
'bm_hash_small8'
config	baseline	inlineMH
mean	146.88	144.66(-2%)
std	5.9	16.32
'bm_bighash'
config	baseline	inlineMH
mean	222.56	227.3(+2%)
std	18.86	13.25
'bm_hash_small2'
config	baseline	inlineMH
mean	288.65	297.44(+3%)
std	9.24	6.2

mandelbrot
performance (LOWER is better)
config	baseline	inlineMH
mean	3.31	2.64(+25%)
std	0.06	0.03

kwargs
performance (LOWER is better)
config	baseline	inlineMH
mean	1.47	1.34(+10%)
std	0.01	0.01

weird_sort
performance (HIGHER is better)
'weird'
config	baseline	inlineMH
mean	12.21	14.73(+21%)
std	0.47	0.25
'normal'
config	baseline	inlineMH
mean	18.52	22.87(+23%)
std	0.57	0.74

nashorn
performance (LOWER is better)
config	baseline	inlineMH
loadJS	430822761.366	451870630.684(-5%)
mathTest	71251840.876	84302404.724(-18%)
objectTest	184900.535	184510.721(+0%)
regexTest	19596387.452	19367823.655(+1%)
javaTest	389886.415	387203.316(+1%)
closureTest	83344.562	80929.398(+3%)

The 18% regression on nashorn mathTest is because the inlining of MutableCallSite.getTarget increase the size of the hottest method, and result in a compile failure due to execessive memory usage. Disable inlining of this method fixes the regression. However, the hottest method is very large (contains many invokedynamic with long MethodHandle chain) and even with baseline it is at the edge of breaking the memory limit after inlining, so it is very sensitive to memory consumption. We should still inline MutableCallSite.getTarget() given its benefit in jruby

liqunl · 2021-04-09T15:52:49Z

runtime/compiler/optimizer/InterpreterEmulator.cpp

+      // call site list and take up some inlining budget, causing less methods
+      // to be inlined. Don't create call site for them
+      //
+      switch (resolvedMethod->getRecognizedMethod())


@vijaysun-omr Calls to the following methods are inside the blocks that are folded by branch folding, so they won't be added to the inlining list. Since branch folding is disabled, the following code is to make sure we don't add them to inlining candidate list to save inlining budget for other methods.

vijaysun-omr · 2021-04-10T11:03:38Z

Jenkins test sanity all jdk8,jdk11

liqunl · 2021-04-13T14:49:01Z

@vijaysun-omr I added a new commit to address your comment regarding ignoring the coldness info. Will squash the commits before merge. Could you review again?

liqunl · 2021-04-15T14:02:29Z

@0xdaryl May I ask for your review?

0xdaryl · 2021-04-19T23:40:17Z

Jenkins test sanity all jdk11,jdknext

liqunl · 2021-04-20T02:12:35Z

@0xdaryl The two commits need to be merged into one and there is a typo I'll fix after the PR build completes.

0xdaryl · 2021-04-20T10:34:33Z

CI completed successfully. Please squash.

This commit adds support to inline LambdaForm methods used in OpenJDK MethodHandle implementation. It includes changes to inlining heuristic to treat LambdaForm methods in the same way as thunk archetype, also changes to support emulating interpreter execution of LambdaForm methods to find call sites to inline. The InterpreterEmulator change includes: 1. Support iterating more bytecodes. InterpreterEmulator was originally designed for thunk archetypes, which only uses a limited number of bytecodes. However, LambdaForm can contain any bytecodes. This commit adds support to bytecodes that are commonly used in LambdaForm methods. The support for all bytecodes will be done separately. 2. Track object info in local slots. LambdaForm methods use local variables to store intermediate results that are used as call arguments later. Since the refining of MethodHandle INL (invokeBasic, linkTo*) requires object info, we have to track the object info stored in local variables. 3. Prex arg info propagation to callee in with state mode. Currently, prex arg info propagation is done for non-static methods and when peeking is done. However, LambdaForm methods are static. Since we track the operand stack state, we can create prex arg info from operand stack and propagate it down to callee. 4. Add new operand classes corresponding to PrexArgument. 5. Add merge functions on operand to support merging local slot states and operand stack states. We have to merge information coming from different blocks to continue tracking operand stack state and local slots state. This commit also fixes the following problem: MutableCallSiteOperand extends KnownObjOperand, which is not right. As knownObjOperand doesn't rely on any assumption, but MutableCallSiteOperand represents a known object only when the call site target remain unchanged. InterpeterEmulator does a few operations on KnownObjOperand which shouldn't be applied to MutableCallSiteOperand (for example creating known object for final fields of known objects). Signed-off-by: Liqun Liu <[email protected]>

liqunl · 2021-04-20T14:12:06Z

@0xdaryl Commits are squashed.

0xdaryl · 2021-04-20T14:30:38Z

No need to re-run CI as previous runs passed cleanly. Merging.

This commit adds support for changes introduced in eclipse-openj9#12162. New interpreter emulator code that requires holding VM access while accessing object on the heap is wrapped in front-end calls enabling it to be done safely on JITServer. Signed-off-by: Dmitry Ten <[email protected]>

liqunl force-pushed the adoptOpenjdkMH/inliner branch 2 times, most recently from 3c2926d to 8d958b3 Compare March 9, 2021 20:56

liqunl force-pushed the adoptOpenjdkMH/inliner branch from 8d958b3 to 00d893e Compare March 10, 2021 15:27

vijaysun-omr reviewed Mar 11, 2021

View reviewed changes

liqunl force-pushed the adoptOpenjdkMH/inliner branch 2 times, most recently from 8a986c5 to dad1ed4 Compare March 20, 2021 03:02

liqunl force-pushed the adoptOpenjdkMH/inliner branch from dad1ed4 to f96cb12 Compare March 24, 2021 15:29

liqunl mentioned this pull request Mar 30, 2021

Adopt OpenJDK MethodHandle implementation - JIT #10618

Open

15 tasks

liqunl force-pushed the adoptOpenjdkMH/inliner branch 4 times, most recently from 76c58a4 to 28ed488 Compare April 8, 2021 18:54

liqunl force-pushed the adoptOpenjdkMH/inliner branch from 28ed488 to 4ffe4ba Compare April 8, 2021 20:00

liqunl commented Apr 9, 2021

View reviewed changes

vijaysun-omr approved these changes Apr 15, 2021

View reviewed changes

0xdaryl added the comp:jit label Apr 19, 2021

0xdaryl self-assigned this Apr 19, 2021

0xdaryl approved these changes Apr 20, 2021

View reviewed changes

liqunl force-pushed the adoptOpenjdkMH/inliner branch from 99340e8 to 399bf2d Compare April 20, 2021 14:11

liqunl changed the title ~~LambdaForm methods inlining part 1~~ LambdaForm methods inlining Apr 20, 2021

0xdaryl merged commit 21998f1 into eclipse-openj9:master Apr 20, 2021

This was referenced Apr 21, 2021

Update linkToVirtual to correctly handle J9JNIMethodID->vTableIndex #12513

Merged

OJDK MH: testJITServer failures #11923

Closed

dmitry-ten mentioned this pull request Apr 21, 2021

Add JITServer support for LambdaForm methods inlining #12525

Merged

LambdaForm methods inlining #12162

LambdaForm methods inlining #12162

Conversation

liqunl commented Mar 9, 2021 • edited Loading

liqunl commented Mar 9, 2021

liqunl commented Mar 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vijaysun-omr commented Mar 12, 2021

liqunl commented Mar 30, 2021

vijaysun-omr commented Mar 31, 2021

liqunl commented Apr 1, 2021 • edited Loading

vijaysun-omr commented Apr 1, 2021

liqunl commented Apr 8, 2021

vijaysun-omr commented Apr 8, 2021

vijaysun-omr commented Apr 8, 2021

liqunl Apr 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vijaysun-omr Apr 12, 2021 • edited Loading

Choose a reason for hiding this comment

liqunl commented Apr 9, 2021

Choose a reason for hiding this comment

liqunl commented Apr 9, 2021

liqunl commented Apr 9, 2021 • edited Loading

Choose a reason for hiding this comment

vijaysun-omr commented Apr 10, 2021

liqunl commented Apr 13, 2021

liqunl commented Apr 15, 2021

0xdaryl commented Apr 19, 2021

liqunl commented Apr 20, 2021

0xdaryl commented Apr 20, 2021

liqunl commented Apr 20, 2021

0xdaryl commented Apr 20, 2021

liqunl commented Mar 9, 2021 •

edited

Loading

liqunl commented Apr 1, 2021 •

edited

Loading

liqunl Apr 9, 2021 •

edited

Loading

vijaysun-omr Apr 12, 2021 •

edited

Loading

liqunl commented Apr 9, 2021 •

edited

Loading