-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Affine gap distance #12
Comments
If a state machine with three states per class (like in the original paper) is used then exactly this effect can be learned. This is because there are different parameters for transitions from (say) a previous deletion to a current deletion than from a previous match/substitution to a current deletion. See also #7. transitions = [
(0, 0, (1, 1)),
(1, 0, (1, 1)),
(2, 0, (1, 1)),
(0, 1, (0, 1)),
(1, 1, (0, 1)),
(2, 1, (0, 1)),
(0, 2, (1, 0)),
(1, 2, (1, 0)),
(2, 2, (1, 0)),
(0+3, 0+3, (1, 1)),
(1+3, 0+3, (1, 1)),
(2+3, 0+3, (1, 1)),
(0+3, 1+3, (0, 1)),
(1+3, 1+3, (0, 1)),
(2+3, 1+3, (0, 1)),
(0+3, 2+3, (1, 0)),
(1+3, 2+3, (1, 0)),
(2+3, 2+3, (1, 0))
] I guess that this should be about 10x more computationally expensive than the current If the lattice building is the bottleneck then we can again make a fast, specialised |
Would you mind implementing this using the general machinery? If it performs well, I'll write a fast version. |
Sure - in the mean time I also thought of a slightly faster way to do this, and that is to only add the additional states to the |
What was the trick? |
Currently dedupe uses an affine gap distance. The edit distance is very similar to the Levenshtein distance except that the cost of extending a gap (a deletion or insertion) is less than opening the gap.
This works really well for the kinds of strings we deal with in record linkage. How would we implement this for pyhacrf.
The text was updated successfully, but these errors were encountered: