Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate typos in the dictionary against a dictionary of valid words #1140

Open
peternewman opened this issue May 28, 2019 · 11 comments
Open
Labels
dictionary Changes to the dictionary enhancement

Comments

@peternewman
Copy link
Collaborator

We'll need to find a list of valid words from somewhere, but this keeps happening to varying degrees of detectability, e.g. #1014 (comment)

@peternewman
Copy link
Collaborator Author

In my scripts, before to insert a couple "a->b" in my list, I check them against the usual historycal dictionaries you can find on Unix (like /usr/share/dict/american-english). But of course, there are a lot of specific (and not so specific...) words not included.

Originally posted by @Gelma in #1014 (comment)

@sebweb3r
Copy link
Contributor

sebweb3r commented Aug 11, 2020

If you have aspell installed, you can dump the aspell wordbook.
aspell -d en dump master | aspell -l en expand > words.en.txt

I checked (the probably outdated version of debian stable) with codespell. It results in

words.en.txt:6357: Aline ==> Align
words.en.txt:12212: Ines ==> Lines
words.en.txt:18891: OD ==> OF
words.en.txt:18895: Oder ==> Order, odor
words.en.txt:18899: OT ==> TO, OF, OR
words.en.txt:21359: Thur ==> Their
words.en.txt:21631: Thant ==> Than
words.en.txt:22099: BA ==> BY, BE
words.en.txt:22100: Ba ==> By, be
words.en.txt:26781: Bridget ==> Bridged
words.en.txt:37392: Handel ==> Handle
words.en.txt:43699: Claus ==> Clause
words.en.txt:49978: Capetown ==> Cape town
words.en.txt:58942: Leary ==> Leery
words.en.txt:59428: LSAT ==> LAST
words.en.txt:60627: Muhammadan ==> Muslim
words.en.txt:60629: Mohammedans ==> Muslims
words.en.txt:67337: Noe ==> Not, no, node, know, now
words.en.txt:69667: ND ==> AND, 2ND
words.en.txt:69668: Nd ==> And, 2nd
words.en.txt:69671: Ned ==> Need
words.en.txt:90954: Somme ==> Some
words.en.txt:95817: Sade ==> Sad
words.en.txt:99166: Te ==> The, be
words.en.txt:103710: Donn ==> Done, don
words.en.txt:108316: Tuscon ==> Tucson
words.en.txt:112621: Weill ==> Will
words.en.txt:114694: Waring ==> Warning
words.en.txt:117705: Chanel ==> Channel
words.en.txt:124114: Parana ==> Piranha

@sebweb3r
Copy link
Contributor

sebweb3r commented Aug 11, 2020

hunspell and en_GB results in
unmunch /usr/share/hunspell/en_GB.dic /usr/share/hunspell/en_GB.aff > words.en.hunspell.GB.txt

algebraical ==> algebraic
alls ==> all, falls
Alway ==> Always
amened ==> amended, amend
anonyms ==> anonymous
Appling ==> Applying, appalling
arbitral ==> arbitraryrecommanded
Aske ==> Ask
aspected ==> expected
Asser ==> Assert
ba ==> by, be
BA ==> BY, BE
Ba ==> By, be
Bacup ==> Backup
BEng ==> being
Berkley ==> Berkeley
bion ==> bio
BrE ==> be, brie
Bridget ==> Bridged
brose ==> browse, rose
cacheing ==> caching
caesarian ==> caesarean
calender ==> calendar
calenders ==> calendars
Cann ==> Can
cannister ==> canister
cannisters ==> canisters
canonicalizations ==> canonicalization
Chanel ==> Channel
charas ==> chars
Claus ==> Clause
co-ordinate ==> coordinate
co-ordinates ==> coordinates
Commerical ==> Commercial
complier ==> compiler
compliers ==> compilers
connexion ==> connection
contiguities ==> continuities
convertor ==> converter
convertors ==> converters
Corse ==> Course
Cound ==> Could, count
decompresser ==> decompressor
delink ==> unlink
Delting ==> Deleting
delusionally ==> delusively
demographical ==> demographic
Depden ==> Depend
despatch ==> dispatch
dessicate ==> desiccate
dessication ==> desiccation
dessicated ==> desiccated
digitalise ==> digitize
digitalising ==> digitizing
digitalize ==> digitize
digitalizing ==> digitizing
discernable ==> discernible
drats ==> drafts
earlies ==> earliest
easer ==> easier, eraser
Ede ==> Edge
Effient ==> Efficient
equipments ==> equipment
extraversion ==> extroversion
extravert ==> extrovert
extraverts ==> extroverts
fightings ==> fighting
fightings ==> fighting
Flagg ==> Flag
floatation ==> flotation
focussed ==> focused
refocussed ==> refocused
focussed ==> focused
focusses ==> focuses
informations ==> information
formate ==> format
formates ==> formats
informations ==> information
Frome ==> From
funguses ==> fungi
Gardai ==> Gardaí
geometrician ==> geometer
Guatamala ==> Guatemala
Hald ==> Held
hander ==> handler
Handel ==> Handle
happing ==> happening, happen
Harth ==> Hearth
heathy ==> healthy
heigh ==> height, high
homogenous ==> homogeneous
Humber ==> Number
incidently ==> incidentally
incudes ==> includes
infectuous ==> infectious
intension ==> intention
internation ==> international
interpolar ==> interpolator
interpretor ==> interpreter
invokable ==> invocable
keypair ==> key pair
keypairs ==> key pairs
keyserver ==> key server
keyservers ==> key servers
Leary ==> Leery
leat ==> lead, leak, least, leaf
leats ==> least
Mabe ==> Maybe
Mata ==> Meta, mater
meeds ==> needs
miniscule ==> minuscule
Monserrat ==> Montserrat
Muhammadan ==> Muslim
commutating ==> commuting
commutated ==> commuted
Nd ==> And, 2nd
Ned ==> Need
ned ==> need
Noth ==> North
OD ==> OF
Oder ==> Order, odor
ons ==> owns
OT ==> TO, OF, OR
overrideable ==> overridable
patten ==> pattern, patent
pattens ==> patterns, patents
Pattens ==> Patterns, patents
penality ==> penalty
Pennal ==> Panel
compliers ==> compilers
Poer ==> Power
Pont ==> Point
Ponting ==> Pointing
pre-empt ==> preempt
precent ==> percent, prescient
quitted ==> quit
raison ==> reason, raisin
readapted ==> re-adapted
recommand ==> recommend
recommanded ==> recommended
recommands ==> recommends
reoccurrence ==> recurrence
revaluated ==> reevaluated
scaleability ==> scalability
scaleable ==> scalable
setted ==> set
Skelton ==> Skeleton
Smoot ==> Smooth
Somme ==> Some
Sowe ==> Sow, so we
squirl ==> squirrel
Stoer ==> Store
targetting ==> targeting
targetted ==> targeted
Te ==> The, be
Tey ==> They
Thant ==> Than
this'd ==> this would
Thur ==> Their
Toi ==> To, toy
trigged ==> triggered
Tring ==> Trying, string, ring
Troup ==> Troupe
unmistakeably ==> unmistakably
Varian ==> Variant
Vermillion ==> Vermilion
Wass ==> Was
Wil ==> Will, well
Winn ==> Win
Worser ==> Worse
worthing ==> worth, meriting
Worthing ==> Worth, meriting

@peternewman
Copy link
Collaborator Author

Thanks for this @sebweb3r . As you may have seen we got some core stuff in via #1142 . I'm not quite sure what my "other examples need checking" comment meant with regards not closing this issue then.

I'm a bit unclear which way your checks have been done. Is this Codespell run against aspell and hunspell's dictionaries?

Also

-seledted->sekected
+seledted->selected

Originally posted by @peternewman in #1619

@sebweb3r
Copy link
Contributor

Sorry for not being precise. I've dumped the aspell and hunspell dictionaries. Then, I've checked the dumbs with codespell.

So all of these lines are words, that are "wrong" in codespell, but exist in aspell or hunspell.

But I'm not sure, if one wants to delete all of the corrections.

@sebweb3r
Copy link
Contributor

I haven't seen #1142 yet, but I will have a closer look.

@lurch
Copy link
Contributor

lurch commented Aug 13, 2020

Going the other way around, and running aspell against codespell's correct-words (generated with cat dictionary.txt | cut -d'>' -f2 | sort | uniq > codespell_corrections.txt) also suggests:

  • acrued->accured should be acrued->accrued
  • acknodledgment->acknowledgment should be acknodledgment->acknowledgement
  • acknoledgment->acknowledgment should be acknoledgment->acknowledgement
  • aquries->acquries should be aquries->acquires
  • adjacentcy->adjacence should probably be adjacentcy->adjacency, adjacence, ?
  • affinitied->affinitized should probably be affinitied->affinities, affinitized, ?
  • asymetri->assymetry should be asymetri->asymmetry (and could be asymetri->asymmetric, asymmetry,)

(And possibly many more? I got bored of checking... 😉 (as there are many entries in codespell that aspell doesn't recognise, but a Google search suggests are still spelled correctly) )

If codespell is going to suggest corrections, those corrections ought to be spelled correctly 😀

sebweb3r added a commit to sebweb3r/codespell that referenced this issue Aug 13, 2020
@sebweb3r
Copy link
Contributor

@lurch that's why I never let spellcheckers automatically fix the errors.
I basically checked the correct spellings in #1624 already (against aspell-enUS dict) ;-)
I added some of your suggestions.

acknowledgment depends on enUS or enGB #1623 (One of the physics journals insists on using the variant without e. But they have both spellings on their introductions webpage :-) )

sebweb3r added a commit to sebweb3r/codespell that referenced this issue Aug 13, 2020
@lurch
Copy link
Contributor

lurch commented Aug 13, 2020

acknowledgment depends on enUS or enGB

Ooops, I didn't realise that it had multiple spellings (like color / colour), sorry!

I added some of your suggestions.

Cool 👍

sebweb3r added a commit to sebweb3r/codespell that referenced this issue Aug 13, 2020
@peternewman
Copy link
Collaborator Author

So I added some checking, but we need #1485 to have a larger dictionary and fewer false positives, or we need to split the main dictionary and rare into corrections that are in the dictionary and those that aren't, so we can prioritise more carefully checking the non-dictionary words. Currently it doesn't check the corrections as lots of valid technical terms aren't in the aspell word list.

sebweb3r added a commit to sebweb3r/codespell that referenced this issue Aug 18, 2020
sebweb3r added a commit to sebweb3r/codespell that referenced this issue Aug 30, 2020
sebweb3r added a commit to sebweb3r/codespell that referenced this issue Sep 2, 2020
@DimitriPapadopoulos
Copy link
Collaborator

@peternewman Words not in aspell dictionary can be added after #2933. Such words need to be whitelisted because some specialised words will be missing from the aspell or other dictionaries, no matter how large the dictionary is. Can we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dictionary Changes to the dictionary enhancement
Projects
None yet
Development

No branches or pull requests

4 participants