-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8n_to_uvchr(): Simplify and fix some overlongs edge cases #22757
Merged
+100
−110
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
khwilliamson
force-pushed
the
overlong_calc
branch
from
November 24, 2024 18:17
236578e
to
d7e6167
Compare
There are shortcuts available that cut these 8 names to 2.
It turns out that the information generated in this block is only needed if the final conditional in this complicated group of them is true, which checks if the caller wants anything special for certain classes of code points. Because that final condition is subsidiary, the block was getting executed just to be thrown away.
As a first step in simplifying this overly complicated series of conditionals, pull out the first one into a separate 'if'. The next commits will do more.
This hoists a clause in a complex conditional to the 'if' statement above it, converting that to two conditionals from one, while decreasing the number in the much larger interior 'if' by 1. This is in preparation for further simplifications in the next few commits.
This splits these into an if clause, and an else clause
This makes things a bit simpler, but mainly leads to further simplifications in the next commits.
More rigorous testing of the overlong malformation, yet to be committed, showed that this needs to be handled specially. This commit does part of that. Perl extended UTF-8 means you are using a start byte not recognized by any UTF-8 standard. Suppose it is an overlong sequence that reduces down to something representable using standard UTF-8. The string still used non-standard UTF-8 to get there, so should still be called out when the input parameters to this function ask for that. This commit is a first step towards that.
By not overriding the computed value of malformed input until later in the function, we can eliminate this temporary variable. This paves the way to a much bigger simplification in the next commit.
It turns out that the work being done in the first block is only used in the second block. If that block doesn't get executed, the first block's effort is thrown away. So fold the first block into the second. This results in a bunch of temporaries that were used to communicate between the blocks being able to be removed. More detailed comments are added.
Don't execute this loop if it would be pointless.
Make sure it isn't being called with unexpected input --
Admittedly not much work, but I realized in code reading that there are function exits that ignore this initialization. Instead move the initialization to later, where it is actually needed
Remove excess indentation
More rigorous testing of the overlong malformation, yet to be committed, showed that this didn't work as intended. The IS_UTF8_START_BYTE() excludes start bytes that always lead to overlong sequences. Fortunately the logic caused that to be mostly bypassed. But this commit fixes it all.
khwilliamson
force-pushed
the
overlong_calc
branch
from
November 24, 2024 18:32
d7e6167
to
bfac0e3
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commits came about from my reading the code, and doing more rigorous testing (the commits for which are WIP) that revealed some corner case issues with handling overlong sequences and ones that overflow the platform's word size. The changes here address the overlong ones. Current test cases do not catch these obscure bugs; those will come in later pull requests.