Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Affine gap distance #12

Open
fgregg opened this issue Jun 9, 2015 · 4 comments
Open

Affine gap distance #12

fgregg opened this issue Jun 9, 2015 · 4 comments

Comments

@fgregg
Copy link
Contributor

fgregg commented Jun 9, 2015

Currently dedupe uses an affine gap distance. The edit distance is very similar to the Levenshtein distance except that the cost of extending a gap (a deletion or insertion) is less than opening the gap.

This works really well for the kinds of strings we deal with in record linkage. How would we implement this for pyhacrf.

@dirko
Copy link
Owner

dirko commented Jun 10, 2015

If a state machine with three states per class (like in the original paper) is used then exactly this effect can be learned. This is because there are different parameters for transitions from (say) a previous deletion to a current deletion than from a previous match/substitution to a current deletion. See also #7.

transitions = [
 (0, 0, (1, 1)),
 (1, 0, (1, 1)),
 (2, 0, (1, 1)),
 (0, 1, (0, 1)),
 (1, 1, (0, 1)),
 (2, 1, (0, 1)),
 (0, 2, (1, 0)),
 (1, 2, (1, 0)),
 (2, 2, (1, 0)),

 (0+3, 0+3, (1, 1)),
 (1+3, 0+3, (1, 1)),
 (2+3, 0+3, (1, 1)),
 (0+3, 1+3, (0, 1)),
 (1+3, 1+3, (0, 1)),
 (2+3, 1+3, (0, 1)),
 (0+3, 2+3, (1, 0)),
 (1+3, 2+3, (1, 0)),
 (2+3, 2+3, (1, 0))
]

I guess that this should be about 10x more computationally expensive than the current DefaultStateMachine. Unfortunately atm I can't think of a simpler scheme that will achieve the same effect.

If the lattice building is the bottleneck then we can again make a fast, specialised ThreeStateMachine class using your precomputing trick.

@fgregg
Copy link
Contributor Author

fgregg commented Aug 6, 2015

Would you mind implementing this using the general machinery? If it performs well, I'll write a fast version.

@dirko
Copy link
Owner

dirko commented Aug 10, 2015

Sure - in the mean time I also thought of a slightly faster way to do this, and that is to only add the additional states to the match class (or to the non-match class). That should halve the additional transitions.

@fgregg
Copy link
Contributor Author

fgregg commented May 9, 2016

What was the trick?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants