-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LambdaForm methods inlining #12162
LambdaForm methods inlining #12162
Conversation
3c2926d
to
8d958b3
Compare
@0xdaryl @vijaysun-omr May I ask for your review? |
8d958b3
to
00d893e
Compare
@@ -4224,13 +4229,18 @@ TR_MultipleCallTargetInliner::exceedsSizeThreshold(TR_CallSite *callSite, int by | |||
// HACK: Get frequency from both sources, and use both. You're | |||
// only cold if you're cold according to both. | |||
|
|||
bool isLambdaFormGeneratedMethod = comp()->fej9()->isLambdaFormGeneratedMethod(callerResolvedMethod); | |||
// TODO: we should ignore frequency for thunk archetype, however, this require performance evaluation | |||
bool frequencyIsInaccurate = isLambdaFormGeneratedMethod; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain this further ? i.e. why is the frequency inaccurate for lambda form methods ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LambdaForm in a MethodHandle can change. The change usually happen when customization happens, i.e. when a MethodHandle has been executed frequent enough, a customized LambdaForm will replace the existing LambdaForm. The frequency information of the original LambdaForm is not inherited to the customized one, thus the customized one might seem cold when it's compiled or inlined.
Given the large amount of changes to the interpreter emulator, I would like to take a step back and ask how this code is getting tested in general. I would also like us to consider what are some bugs that have been hit in the past, so that we can think about whether we could have variants of those kinds of problems with these changes. |
8a986c5
to
dad1ed4
Compare
dad1ed4
to
f96cb12
Compare
Update: I've done performance measurement with this change on jruby and nashorn, the performance is on par with baseline. I'm now looking at a failure on zlinux. |
Thanks @liqunl could you please share more details about the performance experiments you alluded to in your prior comment ? |
Actually, there is a few benchmarks with regression, I mistaken higher score as better for them. Overall, the change is helping or not hurting. I'll look at those with regression. For jruby, the benchmark I ran is https://github.com/headius/bench2018, the one that we investigated before.
For nashorn, it's the one provided by user in this issue #5371
Any difference within 10% is within the noise margin, Edit: the nashorn number is the lower the better, so the |
Okay thanks @liqunl I agree with your prioritization of looking at the few regressions in this set of numbers first. |
76c58a4
to
28ed488
Compare
@vijaysun-omr I've fixed the regression and the test failure. With this change, I'm seeing performance gain on all jruby benchmarks on OpenJ9 MethodHandle implementation. On nashorn benchmarks, the performance is on par with the baseline. Will put the numbers once I have all the runs finished. Could you review this PR? |
28ed488
to
4ffe4ba
Compare
Thanks @liqunl. It is'nt completely clear to me what the final delta fix was that helped you attain the performance you mentioned in your last comment. Is it possible to describe and/or point me to that piece of code in this non trivial commit so that I don't have to go through the entire commit again ? |
Jenkins test sanity all jdk8,jdk11 |
// TODO: add code to record dead path and ignore it in object info propagation, enable | ||
// the following code if branch folding is possible in LambdaForm methods | ||
// | ||
if (false && second && first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vijaysun-omr Here's the first change to attain the performance. The change in this PR will propagate state in local slots and operand stack in bytecode iterator's order. A block will get local slots and operand stack states from its predecessors, but this requires all the predecessors to be visited before the successor, otherwise we set the local slots and operand stack state of the block being visited to unknown, this will cause the failure to find a call target for a MethodHandle call as the target searching requires known MethodHandle object as the receiver of the call.
The disabled code here does branch folding, so we may skip interpreting a block's bytecodes, introducing blocks that are never visited, stopping the propagation of object info to their successors.
Similar branch folding exists in ILGen, so we don't gen bytecodes in dead path, this change won't result in more nodes being generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the good explanation. That makes sense but not very obvious until you clarified :)
if (canFallThru) | ||
{ | ||
debugTrace(tracer(), "maintainStackForIf canFallThrough to bcIndex=%d\n", fallThruBC); | ||
genTarget(fallThruBC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vijaysun-omr This change is to gen fall through block first to make sure predecessor is interpreted before successor. This only helps in thunk archetype as each thunk archetype (except those of the leaf MethodHandle) contains customization logic which looks like the following, where the fall through is predecessor of the branch target.
if (ILGenMacros.isShareableThunk()) {
undoCustomizationLogic(next);
undoCustomizationLogic(filters);
}
if (!ILGenMacros.isCustomThunk()) {
doCustomizationLogic();
}
However, this won't help when we have more complicated control flow, such as nested if statement. So a separate item will be created to do a reverse post order traversal at CFG level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that sounds important since it can reduce the effectiveness of the interpretation that the whole scheme depends on.
Can you please open that issue and make sure it is on the plan to be worked on by talking to Daryl ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
// TODO: customized lambda form method is similar, it may not | ||
// be executed enough before it gets inlined | ||
// | ||
if (_callerIsThunkArchetype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vijaysun-omr Another problem with previous commit is that it refactored the code such that we only compute return value of non-cold calls. However, because thunk archetypes are never interpreted, so the call bytecodes always appear as cold. This includes the call to MutableCallSite.getTarget()
, whose return value is used to guide MutableCallSite inlining. MutableCallSite inlining is broken because previous commit doesn't compute this call's value due to incorrect coldness info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because MutableCallSite.getTarget()
is treated as cold in the baseline, it won't be inlined in the first inlining pass (will be inlined in targeted inlining only if a pass is requested), so the fix here will inline MutableCallSite.getTarget()
in the first pass which gives us performance gain in jruby benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this sounds critical and I doubt that filtering on cold paths in this code is the right way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem is the check on isUnresolvedInCP
. Since the thunk archetypes are never interpreted, and we always do compile time resolve on them, so the majority of its cp entries will appear unresolved. I think we should at least ignore isUnresolvedInCP
for calls in thunk archetypes. @vijaysun-omr What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense. Perhaps these routines from the cold block marker code help you think about and handle these sorts of cases :
and
@vijaysun-omr I added a few comments to explain the changes I made to fix the regression in previous commit. |
void | ||
InterpreterEmulator::visitInvokevirtual() | ||
{ | ||
int32_t cpIndex = next2Bytes(); | ||
auto calleeMethod = (TR_ResolvedJ9Method*)_calltarget->_calleeMethod; | ||
bool isUnresolvedInCP; | ||
TR_ResolvedMethod * resolvedMethod = calleeMethod->getResolvedPossiblyPrivateVirtualMethod(comp(), cpIndex, true, &isUnresolvedInCP); | ||
// Calls in thunk archetype won't be executed by inliner, so they may appear as unresolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here, 'inliner' should be 'interpreter', will fix it before the merge.
The Pull request failed of connection issue
|
Perf numbers. This change on OpenJ9 MethodHandle implementation vs baseline
The 18% regression on nashorn mathTest is because the inlining of |
// call site list and take up some inlining budget, causing less methods | ||
// to be inlined. Don't create call site for them | ||
// | ||
switch (resolvedMethod->getRecognizedMethod()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vijaysun-omr Calls to the following methods are inside the blocks that are folded by branch folding, so they won't be added to the inlining list. Since branch folding is disabled, the following code is to make sure we don't add them to inlining candidate list to save inlining budget for other methods.
Jenkins test sanity all jdk8,jdk11 |
@vijaysun-omr I added a new commit to address your comment regarding ignoring the coldness info. Will squash the commits before merge. Could you review again? |
@0xdaryl May I ask for your review? |
Jenkins test sanity all jdk11,jdknext |
@0xdaryl The two commits need to be merged into one and there is a typo I'll fix after the PR build completes. |
CI completed successfully. Please squash. |
This commit adds support to inline LambdaForm methods used in OpenJDK MethodHandle implementation. It includes changes to inlining heuristic to treat LambdaForm methods in the same way as thunk archetype, also changes to support emulating interpreter execution of LambdaForm methods to find call sites to inline. The InterpreterEmulator change includes: 1. Support iterating more bytecodes. InterpreterEmulator was originally designed for thunk archetypes, which only uses a limited number of bytecodes. However, LambdaForm can contain any bytecodes. This commit adds support to bytecodes that are commonly used in LambdaForm methods. The support for all bytecodes will be done separately. 2. Track object info in local slots. LambdaForm methods use local variables to store intermediate results that are used as call arguments later. Since the refining of MethodHandle INL (invokeBasic, linkTo*) requires object info, we have to track the object info stored in local variables. 3. Prex arg info propagation to callee in with state mode. Currently, prex arg info propagation is done for non-static methods and when peeking is done. However, LambdaForm methods are static. Since we track the operand stack state, we can create prex arg info from operand stack and propagate it down to callee. 4. Add new operand classes corresponding to PrexArgument. 5. Add merge functions on operand to support merging local slot states and operand stack states. We have to merge information coming from different blocks to continue tracking operand stack state and local slots state. This commit also fixes the following problem: MutableCallSiteOperand extends KnownObjOperand, which is not right. As knownObjOperand doesn't rely on any assumption, but MutableCallSiteOperand represents a known object only when the call site target remain unchanged. InterpeterEmulator does a few operations on KnownObjOperand which shouldn't be applied to MutableCallSiteOperand (for example creating known object for final fields of known objects). Signed-off-by: Liqun Liu <[email protected]>
99340e8
to
399bf2d
Compare
@0xdaryl Commits are squashed. |
No need to re-run CI as previous runs passed cleanly. Merging. |
This commit adds support for changes introduced in eclipse-openj9#12162. New interpreter emulator code that requires holding VM access while accessing object on the heap is wrapped in front-end calls enabling it to be done safely on JITServer. Signed-off-by: Dmitry Ten <[email protected]>
This commit adds support to inline LambdaForm methods used in OpenJDK
MethodHandle implementation. It includes changes to inlining heuristic
to treat LambdaForm methods in the same way as thunk archetype, also
changes to support emulating interpreter execution of LambdaForm methods
to find call sites to inline.
The InterpreterEmulator change includes:
Support iterating more bytecodes. InterpreterEmulator was originally
designed for thunk archetypes, which only uses a limited number of
bytecodes. However, LambdaForm can contain any bytecodes. This commit
adds support to bytecodes that are commonly used in LambdaForm methods.
The support for all bytecodes will be done separately.
Track object info in local slots. LambdaForm methods use local
variables to store intermediate results that are used as call arguments
later. Since the refining of MethodHandle INL (invokeBasic, linkTo*)
requires object info, we have to track the object info stored in local
variables.
Prex arg info propagation to callee in with state mode. Currently,
prex arg info propagation is done for non-static methods and when
peeking is done. However, LambdaForm methods are static. Since we track
the operand stack state, we can create prex arg info from operand stack
and propagate it down to callee.
Add new operand classes corresponding to PrexArgument.
Add merge functions on operand to support merging local slot states
and operand stack states. We have to merge information coming from
different blocks to continue tracking operand stack state and local
slots state.
This commit also fixes the following problem:
MutableCallSiteOperand extends KnownObjOperand, which is not right. As
knownObjOperand doesn't rely on any assumption, but
MutableCallSiteOperand represents a known object only when the call site
target remain unchanged. InterpeterEmulator does a few operations on
KnownObjOperand which shouldn't be applied to MutableCallSiteOperand
(for example creating known object for final fields of known objects).
Part of #10618
Signed-off-by: Liqun Liu [email protected]