You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a generated index which after loading will not match certain keywords in the inverted index. However some tricks will cause lunr to find matches for the exact same keywords.
I was not able to create a very minimal example, but the original generated index (and the reduced version) is manageable. I could not append json files as attachments so I have created a gist which includes both the full index and the one referenced to in the code below. There is also a reproduce script which will illustrate the behaviour.
Loading the index and searching for the keyword will not work:
let filename = './index-data.compact.json';
let search = 'EIFUW001R00';
let data = JSON.parse (fs.readFileSync (filename));
let index = lunr.Index.load (data);
console.log (index.search (search));
But the keyword is in the inverted index, we can manually see inspect the json:
In the reduced example I have removed all the entries in the inverted index starting from that match. When I add just the first entry (the one we try to match) we do find one:
Even weirder is that altering some lunr internals (the ids of the TokenSet) will cause everything to work again.
data.invertedIndex.push (/* ... */);
data.invertedIndex.push (/* ... */);
lunr.TokenSet._nextId++;
let index = lunr.Index.load (data);
console.log (index.search (search));
Changing any of the entries in the inverted index before eifuw001r00 will make the search work too. Although I have found some changes which will still cause the search to fail, nearly all changes will make it work though. This is also the reason why I have a rather long index in the reproduction example.
I get the impression that the behaviour is related to this code here. The result of TokenSet.toString(), which includes the id, is used as the key for a lookup in this.minimizedNodes. I'm guessing that it matches something that it should not, and modifies a token set's edges, which cause it to lose information.
I have also tried to peek inside the generation of the token set inside he TokenSet.Builder (using the code in the gist). Everything seems to be going fine until the call to TokenSet.finish (). After that it seems like it only knows about eifuw0 and eifuw1 instead of eifuw001r00 and eifuw001r01.
Any idea what is causing this?
Is it expected behaviour, or is this a bug?
Is there a way to fix, or detect this?
Kind regards
The text was updated successfully, but these errors were encountered:
jmxti
changed the title
Index.load () createds instance which does not find words present in the inverted indexIndex.load () creates instance which does not find words present in the inverted index
Sep 28, 2021
some further debugging gives me that I have at some point a token set with two edges: one for '0' and one for '1', both point to a single final token set which happens to have id 4. so computing the toString () (can be found here) yields: '0' not final, and '0' + '4' for the first edge and '1' + '4' for the second edge, or '00414'
I also happen to have a token set with a single edge for '0' which points to a token set with id '414'. So computing the toString () yields: '0' not final, and '0' + '414' for the only edge, or '00414'.
Both of them have the same string representation but they don't represent the same thing. The first represents that your search term ends with either a '0' or a '1', while the second represents that you can have a '0' followed by some other characters.
The reason I could not reduce the number of entries in the inverted index is because it would cause that last id '414' to change. This is also the reason lunr.TokenSet._nextIndex++ will make it seem like it fixes the problem. And this is probably also the reason why changes to some of the other terms will make it seem like it fixes the problem. If the number of characters change the indexes will again change before we get to the problem area.
If I change TokenSet.toString (), so it places markers in generated string the problem disappears, something like:
lunr.TokenSet.prototype.toString = function () {
if (this._str) { return this._str; }
var str = this.final ? '1' : '0',
labels = Object.keys(this.edges).sort(),
len = labels.length
for (var i = 0; i < len; i++) {
var label = labels[i],
node = this.edges[label]
str = str + ', L(' + label + ')I(' + node.id + ')'
}
return str
}
I don't think this can cause the same collisions since a label is only a single character, and an id can only contain numbers.
But
I don't know what the performance impact could be.
I don't know if the string is used in other places.
I have a generated index which after loading will not match certain keywords in the inverted index. However some tricks will cause lunr to find matches for the exact same keywords.
I was not able to create a very minimal example, but the original generated index (and the reduced version) is manageable. I could not append json files as attachments so I have created a gist which includes both the full index and the one referenced to in the code below. There is also a reproduce script which will illustrate the behaviour.
Loading the index and searching for the keyword will not work:
But the keyword is in the inverted index, we can manually see inspect the json:
In the reduced example I have removed all the entries in the inverted index starting from that match. When I add just the first entry (the one we try to match) we do find one:
Hovewer when the entry after that is also added, no matches are found:
Even weirder is that altering some lunr internals (the ids of the TokenSet) will cause everything to work again.
Changing any of the entries in the inverted index before
eifuw001r00
will make the search work too. Although I have found some changes which will still cause the search to fail, nearly all changes will make it work though. This is also the reason why I have a rather long index in the reproduction example.I get the impression that the behaviour is related to this code here. The result of
TokenSet.toString()
, which includes the id, is used as the key for a lookup inthis.minimizedNodes
. I'm guessing that it matches something that it should not, and modifies a token set's edges, which cause it to lose information.I have also tried to peek inside the generation of the token set inside he
TokenSet.Builder
(using the code in the gist). Everything seems to be going fine until the call toTokenSet.finish ()
. After that it seems like it only knows abouteifuw0
andeifuw1
instead ofeifuw001r00
andeifuw001r01
.Any idea what is causing this?
Is it expected behaviour, or is this a bug?
Is there a way to fix, or detect this?
Kind regards
The text was updated successfully, but these errors were encountered: