-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python spellchecking API #3425
base: main
Are you sure you want to change the base?
Conversation
e0e9bca
to
f9eaba2
Compare
@nthykier let me know when I should look. I at least just approved CI runs. One option to make it so that I don't have to approve them would be to make some other small PR that we merge. For example it could be just your couple of speed improvements. It would make an ugly rebase issue here, though, so understandable if you'd rather make a different simple PR (or just keep waiting fro me to click "approve" on the runs). |
e606739
to
b766592
Compare
No new code is introduced; only existing code is shuffled around and the functions moved are unchanged as well.
When the spelling dictionaries are loaded, previously the correction line was just stored in memory as a simple text. Through out the code, callers would then have to deal with the `data` attribute, correctly `split()` + `strip()` it. With this change, the dictionary parsing code now encapsulates this problem. The auto-correction works from the assumption that there is only one candidate. This assumption is invariant and seem to be properly maintained in the code. Therefore, we can just pick the first candidate word when doing a correction. In the code, the following name changes are performed: * `Misspelling.data` -> `Misspelling.candidates` * `fixword` -> `candidates` when used for multiple candidates (`fixword` remains for when it is a correction) On performance: Performance-wise, this change moves computation from "checking" time to "startup" time. The performance cost does not appear to be noticeable in my baseline (codespell-project#3419). Though, keep the corpus weakness on the ratio of cased vs. non-cased corrections with multiple candidates in mind. The all lowercase typo is now slightly more expensive (it was passed throughout `fix_case` and fed directly into the `print` in the original code. In the new code, it will always need a `join`). There are still an overweight of lower-case only corrections in general, so the unconditional `.join` alone is not sufficient to affect the performance noticeably.
aea338c
to
b3034fa
Compare
This is as close to a 1:1 conversion as possible. It might change whhen we get to designing the API. The callers have been refactored to only perform the lookup once. This was mostly to keep the code more readable. The performance cost does seem noticable, which is unsurprising. This method has a higher cost towards non-matches which is the most common case. This commit causes the performance to drop roughly 10% on its and we are now slower than the goal.
The refactor is a stepping stone towards the next commit where the inner loop is moved to the `Spellchecker`.
With this rewrite, performance improved slightly and is now down to 7% slower than the baseline (6s vs. 5.6s). There is deliberate an over-indentation left in this commit, since that makes this commit easier to review (without ignoring space changes).
Deliberately in a separate. There are no functional changes, but there are some reformatting changes (line merges) as a consequence of the de-indent.
The `Spellchecker` only needs the `group` method from the `re.Match`. With a bit of generics and typing protocols, we can make the `Spellchecker` work with any token type that has a `group()` method. The `codespell` command line tool still assumes `re.Match` but it can get that via its own line tokenizer, so it all works out for everyone.
The new API has introduced extra overhead per line being spellchecked. One way of optimizing out this overhead, is to spellcheck fewer lines. An obvious choice here, is to optimize out empty and whitespace-only lines, since they will not have any typos at all (on account of not having any words). A side-effect of this change is that we now spellcheck lines with trailing whitespace stripped. Semantically, this gives the same result (per "whitespace never has typos"). Performance-wise, it is faster in theory because the strings are now shorter (since we were calling `.rstrip()` anyway). In pratice, I am not sure we are going to find any real corpus where the trailing whitespace is noteworthy from a performance point of view. On the performance corpus from codespell-project#3491, this takes out ~0.4s of runtime brining us down to slightly above the 5.6s that made the baseline.
This makes the API automatically avoid some declared false-positives that the command line tool would also filter.
The changes to provide a public API had some performance related costs of about 1% runtime. There is no trivial way to offset this any further without undermining the API we are building. However, we can pull performance-related shenanigans to compenstate for the cost introduced. The codespell codebase unsurprisingly spends a vast majority of its runtime in various regex related code such as `search` and `finditer`. The best way to optimize runtime spend in regexes is to not do a regex in the first place, since the regex engine has a rather steep overhead over regular string primitives (that is the cost of flexibility). If the regex rarely matches and there is a very easy static substring that can be used to rule out the match, then you can speed up the code by using `substring in string` as a conditional to skip the regex. This is assuming the regex is used enough for the performance to matter. An obvious choice here falls on the `codespell:ignore` regex, because it has a very distinctive substring in the form of `codespell:ignore`, which will rule out almost all lines that will not match. With this little trick, runtime goes from ~5.6s to ~4.9s on the corpus mentioned in codespell-project#3419.
b3034fa
to
ce280c9
Compare
Per review comment.
Thanks. This PR should be ready for review now. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise looks pretty straightforward !
|
||
class UnknownBuiltinDictionaryError(ValueError): | ||
def __init__(self, name: str) -> None: | ||
super().__init__(f"Unknown built-in dictionary: {name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think codecov is right, these classes aren't used anywhere. Omit for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had a ValueError
with an error message, but that triggers a linter warning about using long error messages with exceptions and that it is better to set the message in the __init__
(which requires the subclass shown). Therefore, omitting it triggers a red CI and is not an option unless you want me to tweak the linter to skip that check.
Alternative, I can add a test for it, since that would solve the codecov issue while avoiding the linter issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, we (or at least I) do not see UnknownBuiltinDictionaryError
raised or caught in the current code, and we cannot find load_builtin_dictionaries
either. Why do you need to define it in this PR?
In any case, the linter rule you refer to has to be TRY003. Without seeing the code that raises or catches the exception, I cannot easily suggest ways to pacify the linter. However, if you only raise the exception once, isn't ValueError
sufficient?
message = f"Unknown built-in dictionary: {name}"
raise ValueError(message)
After this nice refactoring work, I suspect the maximum value of some complexity metrics might be lowered: Lines 169 to 174 in 3760e61
|
Quite possibly. What do you (as a project) expect here? That I lower them to the default values for |
Exactly. The default values won't work, but pylint will show you new lower values. That can of course be deferred to a later PR or left to other maintainers - at your convenience. |
Adding a Spellchecker API
The proposed PR is a follow up to #3419 that introduces a
Spellchecker
class. This PR deliberate does not close #3419, since the public package name should probably be changed first and an__all__
provided in its__init__,py
. In its simplest form, a consumer would do something like this:There is also a
check_lower_cased_word
method for single word checks which as a part of refactoring. I left it in because it has a simpler API if you have words you want to have checked rather than lines.Reviewing this PR
I have optimized the review the PR by reviewing each commit individually with the aim of understanding how we got to the end result. It comes with the following commits:
Each commit is written with the intend to be a small self-contained change. This also means that some commits are just moving code around or even seem a bit sub-optimal with following commits just using them as stepping stone. If you see something in a commit that you want changed, please note that it might be improved in a later commit. I still kept this as one PR since some of the commits ends up with a considerable performance regression in that commit, which is then resolved in later commits. This is all a trade off between keeping commits as standalone and reviewable as possible vs. being able to cut the PR at any given point time. I preferred the former.
Note that GitHub's PR review function (like
Approve
orRequest Changes
) operates on the entire PR even though you review it commit by commit.How the API interfaces with codespell related activities
Dictionary loads
The API by default just loads the default builtin dictionaries. The constructor accepts a sequence/list of names such as
clear
,rare
. After that, the caller can manually load custom dictionaries viaload_dictionary_from_file
. This is the default mode and aimed at the API consumers to quickly get started with the common case.The
codespell
command is not a common case. Here, dictionaries can be loaded in any other. To facilitate that, the following flow can be used:Various exclusions or regex based matches/ignores
Exclusions features supported directly in the API:
codespell:ignore
). Fully handled by default, caller can opt-out.Not supported directly by the API in its current form. Caller must facilitate these:
-X
). Caller should load exclude the line instead.(The codespell tokenizer "factory" plus related default regexes could become part of the API if desired)
What is to be included in the API
These are the items that I would see go into the
__all__
of the codespell API in a later commit / PR.Spellchecker
(introduced in this PR). All its methods would become public API.Misspelling
(existing class). All its data fields would become public API (or at least,candidates
+fix
would be neede for me). Also, we should probably rename it toCorrection
orDictionaryResult
, etc., since it is not a misspelling but the codespell correctional data.codespell
uses a different way to remember its interactively chosen corrections.DetectedMisspelling
(introduced in this PR). Subject to renaming of the class or its properties.LineTokenizer
+Token
+ their generic constraint (T
) for typing purposes.That is the basic API I envisioned and I would need. This API would still have a considerable amount of "bring your own" code, notably the tokenizer is something people will probably struggle with. My project has its own with its own rules for how things should be split into tokens (like ignoring quoted words), so this was not a goal for me. However, it could still be important for you, since it will affect more casual API consumers that did not solve this problem ahead of time. :)
An alternative to more API would to have more examples of how to emulate the codespell command line tool.
Performance cost of the API
It surprised me a bit that at some point the total regression exceeded 10% on the corpus I used for testing (see #3419 for the details on the performance test setup). It was restored to an ~8% regression before the first
Speed up
commit.The first
Speed up
commit reduces the regression to<= 1.5%
(as in, the~0.060s
ballpark) The secondSpeed up
commit was "unnecessary show off" to get below the baseline. 😆These commits speed up commits are technically a "compensation" changes. As an example, I believe you would get a similar boost of performance by retrofitting the second speed up commit on the existing code base for a similar win. The first speed up commit is partly a compensation, partly a counter of the regression since the "overhead per line" got measurably higher (accordingly, reducing the number of redundant lines now makes the improvement considerably more worth it).
The end result is
5.6s -> 4.9s
with an API and 1½-2 performance related compensations.(Side note for GitHub: This also closes #3434.)