Intermittent Failing Test: Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() #1050

NightOwl888 · 2024-11-26T13:28:42Z

Is there an existing issue for this?

I have searched the existing issues

Describe the bug

The Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() test fails under rare random conditions, but does not seem to repeat with the same random seed and culture.

Expected Behavior

The Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() test should always succeed.

Steps To Reproduce

Add the [Repeat(100000)] attribute to the test and it will fail frequently.

Exceptions (if any)

TestMaxClauseLimitations
   Source: TestMultiTermQueryRewrites.cs line 295
   Duration: 2.6 sec

  Message: 
Expected: CheckMaxClauseCount, Actual: Collect

To reproduce this test result:

Option 1:

 Apply the following assembly-level attributes:

[assembly: Lucene.Net.Util.RandomSeed("0x933174838b8c2dfd")]
[assembly: NUnit.Framework.SetCulture("fr-BL")]

Option 2:

 Use the following .runsettings file:

<RunSettings>
  <TestRunParameters>
    <Parameter name="tests:seed" value="0x933174838b8c2dfd" />
    <Parameter name="tests:culture" value="fr-BL" />
  </TestRunParameters>
</RunSettings>

Option 3:

 Create the following lucene.testsettings.json file somewhere between the test assembly and the root of your drive:

{
  "tests": {
     "seed": "0x933174838b8c2dfd",
     "culture": "fr-BL"
  }
}


  Stack Trace: 
TestMultiTermQueryRewrites.CheckMaxClauseLimitation(RewriteMethod method, String memberName) line 268
TestMultiTermQueryRewrites.TestMaxClauseLimitations() line 297
InvokeStub_TestMultiTermQueryRewrites.TestMaxClauseLimitations(Object, Object, IntPtr*)
1)    at Lucene.Net.Search.TestMultiTermQueryRewrites.CheckMaxClauseLimitation(RewriteMethod method, String memberName) in F:\Projects\lucenenet\src\Lucene.Net.Tests\Search\TestMultiTermQueryRewrites.cs:line 262
TestMultiTermQueryRewrites.TestMaxClauseLimitations() line 297
InvokeStub_TestMultiTermQueryRewrites.TestMaxClauseLimitations(Object, Object, IntPtr*)

Lucene.NET Version

4.8.0-beta00017

.NET Version

net8.0

Operating System

Windows

Anything else?

It is known to fail on Windows x64 with net8.0. It may fail also fail under other configurations.

This test should be compared with the behavior in Lucene 4.8.0. For some reason we are sometimes seeing the exception thrown from the Collect() method when we are expecting it to always be thrown from the CheckMaxClauseCount() method according to the way the test is written.

Note also that the assert message is commented out, and it should not be.

The text was updated successfully, but these errors were encountered:

paulirwin · 2024-11-26T19:49:02Z

I can reproduce with enough repeats on net8.0 macOS arm64. I'm fairly certain that what we're seeing is .NET 8's Dynamic PGO inlining this trivial method at runtime. Much more further reading, but search the page for "inline" to see that it does inline trivial methods if it determines that is beneficial: https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/

We could force prevent this with [MethodImpl(MethodImplOptions.NoInlining)] on CheckMaxClauseCount (and initial results in my testing shows that seems to fix it), but IMO we should not rely on that. I think we should remove this assertion, even though it exists upstream (where Java did not have anything like .NET 8's Dynamic PGO to affect this stack trace), with a LUCENENET-specific comment explaining why we're removing it:

//  Maybe remove this assert in later versions, when internal API changes:
Assert.AreEqual("CheckMaxClauseCount", new StackTrace(e, false).GetFrames()[0].GetMethod().Name); //, "Should throw BooleanQuery.TooManyClauses with a stacktrace containing checkMaxClauseCount()");

We should only care that a BooleanQuery.TooManyClausesException is thrown here, and not which specific method throws it.

NightOwl888 · 2024-11-26T20:11:30Z

Actually, it would be a bug if we didn't have [MethodImpl(MethodImplOptions.NoInlining)] on CheckMaxClauseCount. That is a requirement of all of the tests that check for method names in the stack (and always has been). These "who called me" checks exist in Lucene for a reason.

There are a few dozen of these stack trace checks in the tests. We took [MethodImpl(MethodImplOptions.NoInlining)] a bit too far, because it was difficult to distinguish between identically named methods as to which one was expected in the stack (see #931). So, we can remove some of these without causing problems. However, this is a case where we didn't go far enough.

While I would like to find a better way to check these scenarios, I am having trouble coming up with a way doesn't significantly deviate from the upstream code. Checking the stack seems to be the easiest approach. And with optimizations enabled it requires [MethodImpl(MethodImplOptions.NoInlining)] to function.

paulirwin · 2024-11-26T20:59:36Z

These "who called me" checks exist in Lucene for a reason.

What is that reason? If anything, the comment on it indicates a keenness to remove the assertion, rather than keep it, because it acknowledges that it's testing an internal API. Users should only be concerned that the exception was thrown as expected, not that a specific protected method in the call stack was the one that did the check.

Perhaps we could ask the Lucene team for their thoughts, but my gut tells me that if they had anything as awesome as Dynamic PGO at the time, they would lean on the side of removing this assertion rather than harming performance by forcing it not to be inlined, just to keep the assertion in the test (which has no meaningful benefit to end users).

Remember that Dynamic PGO is the runtime observing ways that it can tangibly improve performance based on real-world measured behavior, and then making those changes. By inhibiting this with NoInlining, we're choosing to harm performance just for purity with the original assertion. IMO Dynamic PGO's inlining is a feature, not a bug. And I think we should do a sweep to remove all such assertions and NoInlining that was added to make these assertions pass. Let's err on the side of letting the runtime give us better performance, than having to make sure our stack trace matches some decade+ old version of the JDK's behavior. Especially when that assertion's author was even questioning whether it should be there in the first place. We could be considered the "later version" of this code that is changing the internal API, at runtime, via Dynamic PGO.

paulirwin · 2024-12-02T23:52:42Z

I reviewed the other cases where NoInlining was added so that the stack trace stays "inspectable," and this case stands out as different. The other cases are inside test classes that change the behavior at test runtime; in this case it is only used for an assertion. Based on my rationale above, I've submitted a PR to comment out this assertion and add a comment explaining why, rather than disable PGO or force NoInlining. The assertion inside the try block even says what it's testing: Assert.Fail("Should throw BooleanQuery.TooManyClauses"); - the important thing here is that the exception is thrown, not who threw it.

NightOwl888 added up-for-grabs This issue is open to be worked on by anyone is:bug test-failure good-first-issue Good for newcomers pri:high labels Nov 26, 2024

NightOwl888 added this to the 4.8.0-beta00018 milestone Nov 26, 2024

NightOwl888 mentioned this issue Nov 26, 2024

Expressions.JS: Bitwise left shift bugfixes, #1034 #1048

Merged

4 tasks

paulirwin added a commit to paulirwin/lucene.net that referenced this issue Dec 2, 2024

Disable Tiered (Dynamic) PGO in unit tests, apache#1050

1f164ff

paulirwin added a commit to paulirwin/lucene.net that referenced this issue Dec 2, 2024

Fix intermittent failed TestMaxClauseLimitations tests, apache#1050

ab9a69f

paulirwin mentioned this issue Dec 2, 2024

Fix intermittent failed TestMaxClauseLimitations tests, #1050 #1054

Merged

4 tasks

paulirwin closed this as completed in #1054 Dec 3, 2024

paulirwin added a commit that referenced this issue Dec 3, 2024

Fix intermittent failed TestMaxClauseLimitations tests, #1050 (#1054)

d1386d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent Failing Test: Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() #1050

Intermittent Failing Test: Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() #1050

NightOwl888 commented Nov 26, 2024

paulirwin commented Nov 26, 2024

NightOwl888 commented Nov 26, 2024

paulirwin commented Nov 26, 2024

paulirwin commented Dec 2, 2024

Intermittent Failing Test: Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() #1050

Intermittent Failing Test: Lucene.Net.Search.TestMultiTermQueryRewrites.TestMaxClauseLimitations() #1050

Comments

NightOwl888 commented Nov 26, 2024

Is there an existing issue for this?

Describe the bug

Expected Behavior

Steps To Reproduce

Exceptions (if any)

Lucene.NET Version

.NET Version

Operating System

Anything else?

paulirwin commented Nov 26, 2024

NightOwl888 commented Nov 26, 2024

paulirwin commented Nov 26, 2024

paulirwin commented Dec 2, 2024