Use relative term frequency instead of absolute #317
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In
tntsearch/src/TNTSearch.php
Lines 113 to 122 in a763e66
the score gets calculated. The problem here is that
$document['hit_count']
returns the absolute number how often a term is contained in the document. This leads to the problem that for a spam document which simply contains all words 10 times, the score will be higher than for a document which has some terms more often than others.Example:
Let's take the example from https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Example_of_tf%E2%80%93idf but with a little changed hit counts
We have 4 documents:
Document 1
Document 2
Spam document
We search for "example":
idf = log( 3 / 2 ) = 0.1761 -> the term "example" occurs in 2 / 3 documents.
tf how it is currently calculated in TNT search:
So the spam document has a higher score than document 2, although this document simply contains all words with the same frequency. On the other hand, document 2's most occurring word is "example", so imho this should get higher score.
According to Wikipedia's calculation term frequency is
(There are also some other definitions for the denominator there but I can't find the one used in TNT Search)
So with Wikipedia's definition of term frequency, there would be the following results: