Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esa_matchfinder_find_all_matches dropps all matches with zero offset #1

Open
hegdi opened this issue Aug 3, 2022 · 1 comment
Open

Comments

@hegdi
Copy link

hegdi commented Aug 3, 2022

Finally a suffix-array match-finder which did see enough love to be actually usable! Thank you for that!
After playing with it for a bit I realized that esa_matchfinder_find_all_matches does drop all matches from the beginning of a parse.
So data like this

hello, hello

the match at position 7 with offset 7 and length 5 will silently be dropped as the offset is zero there.
Is that intended behavior?
for me the parse up-to position 7 looks like this, the one with the + is the one reported back.

# position: ( offset, length ) [reported]
0: (   0,    5) 
0: (   0,    1) 

1: (   1,    4) 
1: (   1,    1) 

2: (   2,    3) 
2: (   2,    1) 

3: (   3,    2) 
3: (   1,    1) +

4: (   4,    1) 

5: (   5,    2) 

6: (   6,    1) 

7: (   7,    5) 
7: (   7,    1) 
@IlyaGrebnov
Copy link
Owner

IlyaGrebnov commented Aug 3, 2022

@hegdi Unfortunately, this is by design and I noted this in readme "due to implementation details the esa-matchfinder can not find any matches with offset 0.", This certainly can be fixed, but at performance cost. And based on my testing, this limitation does not make any difference in compression ratio. Alternatively, you can extend input text by additional symbol at the begging. Actual symbol does not matter, because it won't be matched anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants