-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommend scoring hits with BM25(k1=0.9,b=0.4). #46
base: master
Are you sure you want to change the base?
Conversation
Currently different engines use different parameters for BM25, e.g. Tantivy and Lucene use (k1=1.2,b=0.75) while PISA uses (k1=0.9,b=0.4). Robertson et al. had initially suggested that 1.2/0.75 would make good defaults for BM25 but Trotman et al. later suggested that 0.9/0.4 would make better defaults and this seems to be the consensus nowadays. The ranking function matters because it affects which hits may be skipped via dynamic pruninng, which in-turn affects search performance. Closes quickwit-oss#45
I believe that PISA does not require changes though it would be nice to make the BM25 configuration more explicit in the query logic, what do you think @amallia? I could use some help making that change as I'm not too familiar with the PISA API. It looks like Tantivy supports configuring the ranking function, but I'm not proficient in Rust and could use some help there too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR does not make the change for tantivy does it?
It does not indeed. I would like to change it but I am not familiar with Rust and unsure how to do it. I could use some help. |
Separately I checked more search engines and IR toolkits:
So there doesn't really seem to be a consensus actually. The point from the DFR paper that theory meets practice with 1.2/0.75 is quite convincing. Unless I find more evidence that 0.9/0.4 is more effective, I am considering switching PISA to 1.2/0.75 instead of switching Lucene and Tantivy to 0.9/0.4. |
Currently different engines use different parameters for BM25, e.g. Tantivy and Lucene use (k1=1.2,b=0.75) while PISA uses (k1=0.9,b=0.4). Robertson et al. had initially suggested that 1.2/0.75 would make good defaults for BM25 but Trotman et al. later suggested that 0.9/0.4 would make better defaults and this seems to be the consensus nowadays.
The ranking function matters because it affects which hits may be skipped via dynamic pruninng, which in-turn affects search performance.
Closes #45