Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is there a drastic difference in search time between lucenenet in c# and lucene in java while the other statistics are roughly comparable? #333

Closed
parikshitphukan17 opened this issue Sep 1, 2020 · 6 comments
Labels
is:invalid This doesn't seem right is:question Further information is requested

Comments

@parikshitphukan17
Copy link

I have implemented a Lucene POC in Java and dotnet. The stats are roughly comparable except for the search time(time required to get the matching docs). Java application roughly takes 9-10 seconds whereas Dotnet took 52 minutes. I have indexed 99000 documents which comprise of pdf,docs,txt and etc. Indexing for both of the POCs was performed on the same files. Is this disparity in search time expected due to the java version being superior or is there some error in my coding for lucenedotnet ?

@eladmarg
Copy link
Contributor

eladmarg commented Sep 1, 2020

this is interesting, but this such a different doesn't make scene.

10 seconds to 52 minutes (~180K seconds) is not reasonable in any way.
so this is probably a typo, but still, even 5X difference is not realistic.

i think you should re-test your program.
i do believe there are many places we can optimize Lucene and do even better than java thanks to better capabilities of dot net.

@jeme
Copy link
Contributor

jeme commented Sep 1, 2020

This certainly sounds oversuspicious, we have an index with over a million documents, each with thousands of fields.

If i do a free text search (meaning it accesses all fields) starting and ending with a wildcard (eg. *a*), which is probably the worst case scenario I can think of, it takes merely ~50 seconds, that is including loading the actual 9 megabytes of data the search produces and returning it over the wire. (not completely satisfying, but considering the punishment I just directed at the index + wire, it's somewhat understandable)...

@NightOwl888
Copy link
Contributor

Without seeing your test, there is not a lot I can tell you. You didn't even mention what versions of Lucene/Lucene.NET you are testing. Of course, the only apples to apples way to test this would be either to run:

  • Lucene.NET 3.0.3 against its closest counterpart in Lucene, 3.0.1
  • Lucene.NET 4.8.0 against Lucene 4.8.0

Since Lucene 4.8.0 was designed to run with Java 6, you would also need to get a copy of a Java 6 runtime to run it on. And since Java 6 is not available for download from any official source anymore, I strongly suspect you are not doing either of these exact version tests on the version of Java it was designed to run on.

Do note that we have recently set up benchmarks across each of the betas, and we have approximately doubled search performance since 4.8.0-beta00007, so if you are testing on an older beta you will definitely see performance degradation.

Of course, it is possible you have stumbled upon a severe bottleneck in a specific Analyzer, Tokenizer, Codec, Query, or other component, but again, without seeing the code there isn't much we can do. Could you post this POC somewhere in a form where it can be run in both Java and .NET without too much extra configuration, and some setup instructions to get it up and running?

@parikshitphukan17
Copy link
Author

parikshitphukan17 commented Sep 2, 2020

Hi,
Sorry for not giving the specifics of the POC that I made. They are as follows:

Lucene java version 8.6.0
Lucenenet ver 3.03

Unfortunately, I am using multiple libraries for text extraction thus requiring some additional configuration. What I can do is post the index creation and search code snippet that I used for Lucenenet. Would that be enough? I won't be posting the Java version as its working fine and I do not need help with that but do let me know if you need that too.

@jeme
Copy link
Contributor

jeme commented Sep 7, 2020

Considering that your request concerns a "Difference" between the two, posting both is still relevant if anyone is to help you with spotting any notable difference in the two implementations...

Other things that might prove interesting is:

  • Storage information (Storage can in many cases be a bottleneck, this can especially be the case if you store and retrieve the documents in the index...)
  • Index size (Both for Java and .NET) - That is the files in the index.
  • Memory Capacity and Footprint (Both for Java and .NET)

That being said... Currently 4.8 have full focus, so you may ultimately be better of asking this on forums such as StackOverflow etc.
If there is a severe bottleneck in Lucene.NET 3.0.3 I very much doubt that the team will be addressing that problem at this point though, instead full focus is given to get the 4.8 version out and then likely point to that as a solution instead - Obviously if the problem can be replicated on Lucene.NET 4.8.0 then it will be addressed.

But posting your code and any other information you could think of that was relevant here and/or in a StackOverflow question could mean someone could point to something that is not done in a optimal way. As i said above, we have over 4 times the documents and a huge amount of fields, and I can't even get near what you describe even with the most evil queries I can imagine.

@NightOwl888 NightOwl888 added is:invalid This doesn't seem right is:question Further information is requested labels Nov 5, 2020
@NightOwl888
Copy link
Contributor

Since this is not really a fair comparison (4.x+ is a completely different design than 3.x) and as others pointed out there is probably something misconfigured or misused to see results like what you are seeing, I am considering this matter closed. But as @jeme pointed out, you might be more successful getting help with the issue if you post some code.

However, since this seems to be more of a usability issue than an actual bug, it doesn't belong here, either ask on StackOverflow or on the user mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is:invalid This doesn't seem right is:question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants