Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LambdaForm methods inlining #12162

Merged
merged 1 commit into from
Apr 20, 2021

Conversation

liqunl
Copy link
Contributor

@liqunl liqunl commented Mar 9, 2021

This commit adds support to inline LambdaForm methods used in OpenJDK
MethodHandle implementation. It includes changes to inlining heuristic
to treat LambdaForm methods in the same way as thunk archetype, also
changes to support emulating interpreter execution of LambdaForm methods
to find call sites to inline.

The InterpreterEmulator change includes:

  1. Support iterating more bytecodes. InterpreterEmulator was originally
    designed for thunk archetypes, which only uses a limited number of
    bytecodes. However, LambdaForm can contain any bytecodes. This commit
    adds support to bytecodes that are commonly used in LambdaForm methods.
    The support for all bytecodes will be done separately.

  2. Track object info in local slots. LambdaForm methods use local
    variables to store intermediate results that are used as call arguments
    later. Since the refining of MethodHandle INL (invokeBasic, linkTo*)
    requires object info, we have to track the object info stored in local
    variables.

  3. Prex arg info propagation to callee in with state mode. Currently,
    prex arg info propagation is done for non-static methods and when
    peeking is done. However, LambdaForm methods are static. Since we track
    the operand stack state, we can create prex arg info from operand stack
    and propagate it down to callee.

  4. Add new operand classes corresponding to PrexArgument.

  5. Add merge functions on operand to support merging local slot states
    and operand stack states. We have to merge information coming from
    different blocks to continue tracking operand stack state and local
    slots state.

This commit also fixes the following problem:

MutableCallSiteOperand extends KnownObjOperand, which is not right. As
knownObjOperand doesn't rely on any assumption, but
MutableCallSiteOperand represents a known object only when the call site
target remain unchanged. InterpeterEmulator does a few operations on
KnownObjOperand which shouldn't be applied to MutableCallSiteOperand
(for example creating known object for final fields of known objects).

Part of #10618

Signed-off-by: Liqun Liu [email protected]

@liqunl liqunl force-pushed the adoptOpenjdkMH/inliner branch 2 times, most recently from 3c2926d to 8d958b3 Compare March 9, 2021 20:56
@liqunl
Copy link
Contributor Author

liqunl commented Mar 9, 2021

Depends on #11634.
I added commits that this PR is dependent on from #11634. Will remove them before merge

@liqunl
Copy link
Contributor Author

liqunl commented Mar 9, 2021

@0xdaryl @vijaysun-omr May I ask for your review?

@liqunl liqunl force-pushed the adoptOpenjdkMH/inliner branch from 8d958b3 to 00d893e Compare March 10, 2021 15:27
@@ -4224,13 +4229,18 @@ TR_MultipleCallTargetInliner::exceedsSizeThreshold(TR_CallSite *callSite, int by
// HACK: Get frequency from both sources, and use both. You're
// only cold if you're cold according to both.

bool isLambdaFormGeneratedMethod = comp()->fej9()->isLambdaFormGeneratedMethod(callerResolvedMethod);
// TODO: we should ignore frequency for thunk archetype, however, this require performance evaluation
bool frequencyIsInaccurate = isLambdaFormGeneratedMethod;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain this further ? i.e. why is the frequency inaccurate for lambda form methods ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LambdaForm in a MethodHandle can change. The change usually happen when customization happens, i.e. when a MethodHandle has been executed frequent enough, a customized LambdaForm will replace the existing LambdaForm. The frequency information of the original LambdaForm is not inherited to the customized one, thus the customized one might seem cold when it's compiled or inlined.

@vijaysun-omr
Copy link
Contributor

Given the large amount of changes to the interpreter emulator, I would like to take a step back and ask how this code is getting tested in general.

I would also like us to consider what are some bugs that have been hit in the past, so that we can think about whether we could have variants of those kinds of problems with these changes.

@liqunl liqunl force-pushed the adoptOpenjdkMH/inliner branch 2 times, most recently from 8a986c5 to dad1ed4 Compare March 20, 2021 03:02
@liqunl liqunl force-pushed the adoptOpenjdkMH/inliner branch from dad1ed4 to f96cb12 Compare March 24, 2021 15:29
@liqunl
Copy link
Contributor Author

liqunl commented Mar 30, 2021

Update: I've done performance measurement with this change on jruby and nashorn, the performance is on par with baseline. I'm now looking at a failure on zlinux.

@vijaysun-omr
Copy link
Contributor

Thanks @liqunl could you please share more details about the performance experiments you alluded to in your prior comment ?

@liqunl
Copy link
Contributor Author

liqunl commented Apr 1, 2021

Actually, there is a few benchmarks with regression, I mistaken higher score as better for them. Overall, the change is helping or not hurting. I'll look at those with regression. For jruby, the benchmark I ran is https://github.com/headius/bench2018, the one that we investigated before.

red_black
performance (LOWER is better)
config baseline inlineMH
mean 1.01 1.2(-16%)
std 0.02 0.04
aref
performance (HIGHER is better)
'Instantiation'
config baseline inlineMH
mean 23.58 31.05(+32%)
std 0.26 0.64
hash
performance (HIGHER is better)
'bm_hash_small4'
config baseline inlineMH
mean 202.64 216.99(+7%)
std 7.06 3.29
'bm_hash_long'
config baseline inlineMH
mean 81.77 80.34(-2%)
std 3.09 1.68
'bm_hash_small8'
config baseline inlineMH
mean 142.91 151.24(+6%)
std 5.19 8.3
'bm_bighash'
config baseline inlineMH
mean 223.13 234.86(+5%)
std 13.69 12.49
'bm_hash_small2'
config baseline inlineMH
mean 280.5 295.66(+5%)
std 11.37 2.1
mandelbrot
performance (LOWER is better)
config baseline inlineMH
mean 3.28 2.72(+21%)
std 0.07 0.06
kwargs
performance (LOWER is better)
config baseline inlineMH
mean 1.47 1.6(-8%)
std 0.03 0.05
weird_sort
performance (HIGHER is better)
'weird'
config baseline inlineMH
mean 11.8 11.07(-6%)
std 0.16 0.29
'normal'
config baseline inlineMH
mean 18.73 30.28(+62%)
std 0.37 1.38

For nashorn, it's the one provided by user in this issue #5371

nashorn
performance (LOWER is better)
config baseline inlineMH
closureTest 81760.180 145942.900 (-78.5%)
javaTest 394310.934 391017.628 (+0.8%)
loadJS 444556698.310 421764198.560 (+5%)
mathTest 70877351.707 69516350.532 (+2%)
objectTest 184565.074 198183.275 (-7%)
regexTest 19488951.290 19718357.701 (-1.2%)

Any difference within 10% is within the noise margin, so we're seeing a significant improvement on closureTest and the others are on par with baseline.

Edit: the nashorn number is the lower the better, so the closureTest has a large regression with the change, will look at this as well.

@vijaysun-omr
Copy link
Contributor

Okay thanks @liqunl I agree with your prioritization of looking at the few regressions in this set of numbers first.

@liqunl liqunl force-pushed the adoptOpenjdkMH/inliner branch 4 times, most recently from 76c58a4 to 28ed488 Compare April 8, 2021 18:54
@liqunl
Copy link
Contributor Author

liqunl commented Apr 8, 2021

@vijaysun-omr I've fixed the regression and the test failure. With this change, I'm seeing performance gain on all jruby benchmarks on OpenJ9 MethodHandle implementation. On nashorn benchmarks, the performance is on par with the baseline. Will put the numbers once I have all the runs finished. Could you review this PR?

@liqunl liqunl force-pushed the adoptOpenjdkMH/inliner branch from 28ed488 to 4ffe4ba Compare April 8, 2021 20:00
@vijaysun-omr
Copy link
Contributor

Thanks @liqunl. It is'nt completely clear to me what the final delta fix was that helped you attain the performance you mentioned in your last comment. Is it possible to describe and/or point me to that piece of code in this non trivial commit so that I don't have to go through the entire commit again ?

@vijaysun-omr
Copy link
Contributor

Jenkins test sanity all jdk8,jdk11

// TODO: add code to record dead path and ignore it in object info propagation, enable
// the following code if branch folding is possible in LambdaForm methods
//
if (false && second && first)
Copy link
Contributor Author

@liqunl liqunl Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vijaysun-omr Here's the first change to attain the performance. The change in this PR will propagate state in local slots and operand stack in bytecode iterator's order. A block will get local slots and operand stack states from its predecessors, but this requires all the predecessors to be visited before the successor, otherwise we set the local slots and operand stack state of the block being visited to unknown, this will cause the failure to find a call target for a MethodHandle call as the target searching requires known MethodHandle object as the receiver of the call.

The disabled code here does branch folding, so we may skip interpreting a block's bytecodes, introducing blocks that are never visited, stopping the propagation of object info to their successors.

Similar branch folding exists in ILGen, so we don't gen bytecodes in dead path, this change won't result in more nodes being generated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the good explanation. That makes sense but not very obvious until you clarified :)

if (canFallThru)
{
debugTrace(tracer(), "maintainStackForIf canFallThrough to bcIndex=%d\n", fallThruBC);
genTarget(fallThruBC);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vijaysun-omr This change is to gen fall through block first to make sure predecessor is interpreted before successor. This only helps in thunk archetype as each thunk archetype (except those of the leaf MethodHandle) contains customization logic which looks like the following, where the fall through is predecessor of the branch target.

                if (ILGenMacros.isShareableThunk()) {
                        undoCustomizationLogic(next);
                        undoCustomizationLogic(filters);
                }   
                if (!ILGenMacros.isCustomThunk()) {
                        doCustomizationLogic();
                }   

However, this won't help when we have more complicated control flow, such as nested if statement. So a separate item will be created to do a reverse post order traversal at CFG level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that sounds important since it can reduce the effectiveness of the interpretation that the whole scheme depends on.
Can you please open that issue and make sure it is on the plan to be worked on by talking to Daryl ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

// TODO: customized lambda form method is similar, it may not
// be executed enough before it gets inlined
//
if (_callerIsThunkArchetype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vijaysun-omr Another problem with previous commit is that it refactored the code such that we only compute return value of non-cold calls. However, because thunk archetypes are never interpreted, so the call bytecodes always appear as cold. This includes the call to MutableCallSite.getTarget(), whose return value is used to guide MutableCallSite inlining. MutableCallSite inlining is broken because previous commit doesn't compute this call's value due to incorrect coldness info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because MutableCallSite.getTarget() is treated as cold in the baseline, it won't be inlined in the first inlining pass (will be inlined in targeted inlining only if a pass is requested), so the fix here will inline MutableCallSite.getTarget() in the first pass which gives us performance gain in jruby benchmarks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this sounds critical and I doubt that filtering on cold paths in this code is the right way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem is the check on isUnresolvedInCP. Since the thunk archetypes are never interpreted, and we always do compile time resolve on them, so the majority of its cp entries will appear unresolved. I think we should at least ignore isUnresolvedInCP for calls in thunk archetypes. @vijaysun-omr What do you think?

Copy link
Contributor

@vijaysun-omr vijaysun-omr Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liqunl
Copy link
Contributor Author

liqunl commented Apr 9, 2021

@vijaysun-omr I added a few comments to explain the changes I made to fix the regression in previous commit.

void
InterpreterEmulator::visitInvokevirtual()
{
int32_t cpIndex = next2Bytes();
auto calleeMethod = (TR_ResolvedJ9Method*)_calltarget->_calleeMethod;
bool isUnresolvedInCP;
TR_ResolvedMethod * resolvedMethod = calleeMethod->getResolvedPossiblyPrivateVirtualMethod(comp(), cpIndex, true, &isUnresolvedInCP);
// Calls in thunk archetype won't be executed by inliner, so they may appear as unresolved
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here, 'inliner' should be 'interpreter', will fix it before the merge.

@liqunl
Copy link
Contributor Author

liqunl commented Apr 9, 2021

The Pull request failed of connection issue

stderr: fatal: unable to access 'https://github.com/eclipse/openj9.git/': Could not resolve host: github.com

@liqunl
Copy link
Contributor Author

liqunl commented Apr 9, 2021

Perf numbers. This change on OpenJ9 MethodHandle implementation vs baseline

red_black
performance (LOWER is better)
config baseline inlineMH
mean 1.0 0.94(+6%)
std 0.02 0.03
aref
performance (HIGHER is better)
'Instantiation'
config baseline inlineMH
mean 23.03 36.8(+60%)
std 0.69 0.55
hash
performance (HIGHER is better)
'bm_hash_small4'
config baseline inlineMH
mean 209.9 217.7(+4%)
std 10.67 8.04
'bm_hash_long'
config baseline inlineMH
mean 80.31 83.83(+4%)
std 2.99 3.07
'bm_hash_small8'
config baseline inlineMH
mean 146.88 144.66(-2%)
std 5.9 16.32
'bm_bighash'
config baseline inlineMH
mean 222.56 227.3(+2%)
std 18.86 13.25
'bm_hash_small2'
config baseline inlineMH
mean 288.65 297.44(+3%)
std 9.24 6.2
mandelbrot
performance (LOWER is better)
config baseline inlineMH
mean 3.31 2.64(+25%)
std 0.06 0.03
kwargs
performance (LOWER is better)
config baseline inlineMH
mean 1.47 1.34(+10%)
std 0.01 0.01
weird_sort
performance (HIGHER is better)
'weird'
config baseline inlineMH
mean 12.21 14.73(+21%)
std 0.47 0.25
'normal'
config baseline inlineMH
mean 18.52 22.87(+23%)
std 0.57 0.74
nashorn
performance (LOWER is better)
config baseline inlineMH
loadJS 430822761.366 451870630.684(-5%)
mathTest 71251840.876 84302404.724(-18%)
objectTest 184900.535 184510.721(+0%)
regexTest 19596387.452 19367823.655(+1%)
javaTest 389886.415 387203.316(+1%)
closureTest 83344.562 80929.398(+3%)

The 18% regression on nashorn mathTest is because the inlining of MutableCallSite.getTarget increase the size of the hottest method, and result in a compile failure due to execessive memory usage. Disable inlining of this method fixes the regression. However, the hottest method is very large (contains many invokedynamic with long MethodHandle chain) and even with baseline it is at the edge of breaking the memory limit after inlining, so it is very sensitive to memory consumption. We should still inline MutableCallSite.getTarget() given its benefit in jruby

// call site list and take up some inlining budget, causing less methods
// to be inlined. Don't create call site for them
//
switch (resolvedMethod->getRecognizedMethod())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vijaysun-omr Calls to the following methods are inside the blocks that are folded by branch folding, so they won't be added to the inlining list. Since branch folding is disabled, the following code is to make sure we don't add them to inlining candidate list to save inlining budget for other methods.

@vijaysun-omr
Copy link
Contributor

Jenkins test sanity all jdk8,jdk11

@liqunl
Copy link
Contributor Author

liqunl commented Apr 13, 2021

@vijaysun-omr I added a new commit to address your comment regarding ignoring the coldness info. Will squash the commits before merge. Could you review again?

@liqunl
Copy link
Contributor Author

liqunl commented Apr 15, 2021

@0xdaryl May I ask for your review?

@0xdaryl
Copy link
Contributor

0xdaryl commented Apr 19, 2021

Jenkins test sanity all jdk11,jdknext

@0xdaryl 0xdaryl self-assigned this Apr 19, 2021
@liqunl
Copy link
Contributor Author

liqunl commented Apr 20, 2021

@0xdaryl The two commits need to be merged into one and there is a typo I'll fix after the PR build completes.

@0xdaryl
Copy link
Contributor

0xdaryl commented Apr 20, 2021

CI completed successfully. Please squash.

This commit adds support to inline LambdaForm methods used in OpenJDK
MethodHandle implementation. It includes changes to inlining heuristic
to treat LambdaForm methods in the same way as thunk archetype, also
changes to support emulating interpreter execution of LambdaForm methods
to find call sites to inline.

The InterpreterEmulator change includes:

1. Support iterating more bytecodes. InterpreterEmulator was originally
designed for thunk archetypes, which only uses a limited number of
bytecodes. However, LambdaForm can contain any bytecodes. This commit
adds support to bytecodes that are commonly used in LambdaForm methods.
The support for all bytecodes will be done separately.

2. Track object info in local slots. LambdaForm methods use local
variables to store intermediate results that are used as call arguments
later. Since the refining of MethodHandle INL (invokeBasic, linkTo*)
requires object info, we have to track the object info stored in local
variables.

3. Prex arg info propagation to callee in with state mode. Currently,
prex arg info propagation is done for non-static methods and when
peeking is done. However, LambdaForm methods are static. Since we track
the operand stack state, we can create prex arg info from operand stack
and propagate it down to callee.

4. Add new operand classes corresponding to PrexArgument.

5. Add merge functions on operand to support merging local slot states
and operand stack states. We have to merge information coming from
different blocks to continue tracking operand stack state and local
slots state.

This commit also fixes the following problem:

MutableCallSiteOperand extends KnownObjOperand, which is not right. As
knownObjOperand doesn't rely on any assumption, but
MutableCallSiteOperand represents a known object only when the call site
target remain unchanged. InterpeterEmulator does a few operations on
KnownObjOperand which shouldn't be applied to MutableCallSiteOperand
(for example creating known object for final fields of known objects).

Signed-off-by: Liqun Liu <[email protected]>
@liqunl liqunl force-pushed the adoptOpenjdkMH/inliner branch from 99340e8 to 399bf2d Compare April 20, 2021 14:11
@liqunl liqunl changed the title LambdaForm methods inlining part 1 LambdaForm methods inlining Apr 20, 2021
@liqunl
Copy link
Contributor Author

liqunl commented Apr 20, 2021

@0xdaryl Commits are squashed.

@0xdaryl
Copy link
Contributor

0xdaryl commented Apr 20, 2021

No need to re-run CI as previous runs passed cleanly. Merging.

@0xdaryl 0xdaryl merged commit 21998f1 into eclipse-openj9:master Apr 20, 2021
dmitry-ten added a commit to dmitry-ten/openj9 that referenced this pull request Apr 21, 2021
This commit adds support for changes introduced in eclipse-openj9#12162.
New interpreter emulator code that requires holding VM
access while accessing object on the heap is wrapped in front-end
calls enabling it to be done safely on JITServer.

Signed-off-by: Dmitry Ten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants