-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(dict): Remove only corrections if a space could be inserted as well #792
base: master
Are you sure you want to change the base?
fix(dict): Remove only corrections if a space could be inserted as well #792
Conversation
3fd67dc
to
ec32cf5
Compare
@@ -0,0 +1,1000 @@ | |||
the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this off for now because we'd need to workout cases like "extrememe"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worked these cases out by adding the check:
if only_correction.ends_with(suffix) {
// We still want to correct e.g. "extrememe" to "extreme".
return true;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd still prefer not be constrained by this very mechanical process. It can provide insight but I don't trust it to automatically be applied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current process now seems to make exactly the changes we want to all our 63,200 entries, which does inspire some confidence in me. Besides the process is very easy to adapt so I think we can just do so when we figure out that it too eagerly filters out something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current process now seems to make exactly the changes we want to all our 63,200 entries
Except this isn't exactly what I want (see the other thread). Arbitrarily combining words that don't make sense when combined leads to us losing corrections we would have otherwise.
ec32cf5
to
e240e50
Compare
e240e50
to
ea162f0
Compare
ea162f0
to
b80e29d
Compare
b80e29d
to
68cce1a
Compare
I have updated the PR to now also detect |
crates/typos-dict/assets/words.csv
Outdated
aand,and | ||
aanother,another | ||
aapply,apply | ||
aack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The redundant a's is a separate thing and we should be correcting these
The challenge with blindly checking concatenated words is it doesn't filter out for when they don't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'd feel better if we just looked at what changed due to the spaces and applied it to those. We could then separate decide which of these changes might make sense. As is, I'm seeing a lot that don't and don't want to take the time to decide that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the logic I had did actually already detect these with
if only_correction.starts_with(prefix) {
return false;
}
I just must have omitted resetting words.csv
to run SNAPSHOTS=overwrite cargo test verify
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the many cases have just become:
aequidistant
aequivalent
afor
amuch
anumber
ascripts
asudo
imakes
isimilar
itheir
itheirs
iwithout
which I think make sense to not correct automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iwithout
and many of these don't make sense from a "word combining" perspective. Things we can correct in master will become uncorrectable with this change.
The typo dictionary words.csv previously contained a bunch of problematic entries such as: abouta,about algorithmi,algorithm attachen,attach shouldbe,should anumber,number Which resulted in wrong automatic corrections if the following spaces (indicated by ␣) were accidentally missed: about␣a algorithm␣i developed attach␣en masse should␣be a␣number Many of these entries were introduced by taking entries from the codespell-dict and removing corrections containing spaces (since typos currently doesn't support them), e.g the codespell dictionary contains: abouta->about a, about, shouldbe->should, should be, This commit updates `tests/verify.rs` to automatically remove corrections in the form of `{correction}{common_word},{correction}` or `{common_word}{correction},{correction}`, where `{common_word}` is one of the 1000 most frequent English words (except if `{correction}` also ends/starts in `{common_word}`, since we still want to correct e.g. "extrememe" to "extreme"). The top-1000-most-frequent-words.csv file was generated by running: curl https://norvig.com/ngrams/count_1w.txt \ | head -n1024 \ | awk '{print $1;}' \ | grep -vE '^([^ia]|al|re)$' \ > top-1000-most-frequent-words.csv
68cce1a
to
60aad40
Compare
The typo dictionary words.csv previously contained a bunch of problematic entries such as:
Which resulted in wrong corrections if the following spaces (indicated by ␣) were accidentally missed:
Many of these entries were introduced by taking entries from the codespell-dict and removing corrections containing spaces (since typos currently doesn't support them), e.g the codespell dictionary contains:
This commit updates
tests/verify.rs
to automatically remove entries in the form of{correction}{common_word},{correction}
, where{common_word}
is one of the 1000 most frequent English words.The top-1000-most-frequent-words.csv file was generated by running: