Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent Failing Test: Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() #1050

Closed
1 task done
NightOwl888 opened this issue Nov 26, 2024 · 4 comments · Fixed by #1054
Closed
1 task done
Labels
good-first-issue Good for newcomers is:bug pri:high test-failure up-for-grabs This issue is open to be worked on by anyone

Comments

@NightOwl888
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

The Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() test fails under rare random conditions, but does not seem to repeat with the same random seed and culture.

Expected Behavior

The Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() test should always succeed.

Steps To Reproduce

Add the [Repeat(100000)] attribute to the test and it will fail frequently.

Exceptions (if any)

TestMaxClauseLimitations
   Source: TestMultiTermQueryRewrites.cs line 295
   Duration: 2.6 sec

  Message: 
Expected: CheckMaxClauseCount, Actual: Collect

To reproduce this test result:

Option 1:

 Apply the following assembly-level attributes:

[assembly: Lucene.Net.Util.RandomSeed("0x933174838b8c2dfd")]
[assembly: NUnit.Framework.SetCulture("fr-BL")]

Option 2:

 Use the following .runsettings file:

<RunSettings>
  <TestRunParameters>
    <Parameter name="tests:seed" value="0x933174838b8c2dfd" />
    <Parameter name="tests:culture" value="fr-BL" />
  </TestRunParameters>
</RunSettings>

Option 3:

 Create the following lucene.testsettings.json file somewhere between the test assembly and the root of your drive:

{
  "tests": {
     "seed": "0x933174838b8c2dfd",
     "culture": "fr-BL"
  }
}


  Stack Trace: 
TestMultiTermQueryRewrites.CheckMaxClauseLimitation(RewriteMethod method, String memberName) line 268
TestMultiTermQueryRewrites.TestMaxClauseLimitations() line 297
InvokeStub_TestMultiTermQueryRewrites.TestMaxClauseLimitations(Object, Object, IntPtr*)
1)    at Lucene.Net.Search.TestMultiTermQueryRewrites.CheckMaxClauseLimitation(RewriteMethod method, String memberName) in F:\Projects\lucenenet\src\Lucene.Net.Tests\Search\TestMultiTermQueryRewrites.cs:line 262
TestMultiTermQueryRewrites.TestMaxClauseLimitations() line 297
InvokeStub_TestMultiTermQueryRewrites.TestMaxClauseLimitations(Object, Object, IntPtr*)

Lucene.NET Version

4.8.0-beta00017

.NET Version

net8.0

Operating System

Windows

Anything else?

It is known to fail on Windows x64 with net8.0. It may fail also fail under other configurations.

This test should be compared with the behavior in Lucene 4.8.0. For some reason we are sometimes seeing the exception thrown from the Collect() method when we are expecting it to always be thrown from the CheckMaxClauseCount() method according to the way the test is written.

Note also that the assert message is commented out, and it should not be.

@NightOwl888 NightOwl888 added up-for-grabs This issue is open to be worked on by anyone is:bug test-failure good-first-issue Good for newcomers pri:high labels Nov 26, 2024
@NightOwl888 NightOwl888 added this to the 4.8.0-beta00018 milestone Nov 26, 2024
@paulirwin
Copy link
Contributor

I can reproduce with enough repeats on net8.0 macOS arm64. I'm fairly certain that what we're seeing is .NET 8's Dynamic PGO inlining this trivial method at runtime. Much more further reading, but search the page for "inline" to see that it does inline trivial methods if it determines that is beneficial: https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/

We could force prevent this with [MethodImpl(MethodImplOptions.NoInlining)] on CheckMaxClauseCount (and initial results in my testing shows that seems to fix it), but IMO we should not rely on that. I think we should remove this assertion, even though it exists upstream (where Java did not have anything like .NET 8's Dynamic PGO to affect this stack trace), with a LUCENENET-specific comment explaining why we're removing it:

//  Maybe remove this assert in later versions, when internal API changes:
Assert.AreEqual("CheckMaxClauseCount", new StackTrace(e, false).GetFrames()[0].GetMethod().Name); //, "Should throw BooleanQuery.TooManyClauses with a stacktrace containing checkMaxClauseCount()");

We should only care that a BooleanQuery.TooManyClausesException is thrown here, and not which specific method throws it.

@NightOwl888
Copy link
Contributor Author

Actually, it would be a bug if we didn't have [MethodImpl(MethodImplOptions.NoInlining)] on CheckMaxClauseCount. That is a requirement of all of the tests that check for method names in the stack (and always has been). These "who called me" checks exist in Lucene for a reason.

There are a few dozen of these stack trace checks in the tests. We took [MethodImpl(MethodImplOptions.NoInlining)] a bit too far, because it was difficult to distinguish between identically named methods as to which one was expected in the stack (see #931). So, we can remove some of these without causing problems. However, this is a case where we didn't go far enough.

While I would like to find a better way to check these scenarios, I am having trouble coming up with a way doesn't significantly deviate from the upstream code. Checking the stack seems to be the easiest approach. And with optimizations enabled it requires [MethodImpl(MethodImplOptions.NoInlining)] to function.

@paulirwin
Copy link
Contributor

These "who called me" checks exist in Lucene for a reason.

What is that reason? If anything, the comment on it indicates a keenness to remove the assertion, rather than keep it, because it acknowledges that it's testing an internal API. Users should only be concerned that the exception was thrown as expected, not that a specific protected method in the call stack was the one that did the check.

Perhaps we could ask the Lucene team for their thoughts, but my gut tells me that if they had anything as awesome as Dynamic PGO at the time, they would lean on the side of removing this assertion rather than harming performance by forcing it not to be inlined, just to keep the assertion in the test (which has no meaningful benefit to end users).

Remember that Dynamic PGO is the runtime observing ways that it can tangibly improve performance based on real-world measured behavior, and then making those changes. By inhibiting this with NoInlining, we're choosing to harm performance just for purity with the original assertion. IMO Dynamic PGO's inlining is a feature, not a bug. And I think we should do a sweep to remove all such assertions and NoInlining that was added to make these assertions pass. Let's err on the side of letting the runtime give us better performance, than having to make sure our stack trace matches some decade+ old version of the JDK's behavior. Especially when that assertion's author was even questioning whether it should be there in the first place. We could be considered the "later version" of this code that is changing the internal API, at runtime, via Dynamic PGO.

@paulirwin
Copy link
Contributor

I reviewed the other cases where NoInlining was added so that the stack trace stays "inspectable," and this case stands out as different. The other cases are inside test classes that change the behavior at test runtime; in this case it is only used for an assertion. Based on my rationale above, I've submitted a PR to comment out this assertion and add a comment explaining why, rather than disable PGO or force NoInlining. The assertion inside the try block even says what it's testing: Assert.Fail("Should throw BooleanQuery.TooManyClauses"); - the important thing here is that the exception is thrown, not who threw it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-first-issue Good for newcomers is:bug pri:high test-failure up-for-grabs This issue is open to be worked on by anyone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants