Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8n_to_uvchr(): Simplify and fix some overlongs edge cases #22757

Merged
merged 14 commits into from
Nov 24, 2024

Conversation

khwilliamson
Copy link
Contributor

This commits came about from my reading the code, and doing more rigorous testing (the commits for which are WIP) that revealed some corner case issues with handling overlong sequences and ones that overflow the platform's word size. The changes here address the overlong ones. Current test cases do not catch these obscure bugs; those will come in later pull requests.

  • This set of changes may require a perldelta entry, but it is premature

There are shortcuts available that cut these 8 names to 2.
It turns out that the information generated in this block is only needed
if the final conditional in this complicated group of them is true,
which checks if the caller wants anything special for certain classes of
code points.  Because that final condition is subsidiary, the block was
getting executed just to be thrown away.
As a first step in simplifying this overly complicated series of
conditionals, pull out the first one into a separate 'if'.  The next
commits will do more.
This hoists a clause in a complex conditional to the 'if' statement
above it, converting that to two conditionals from one, while decreasing
the number in the much larger interior 'if' by 1.

This is in preparation for further simplifications in the next few
commits.
This splits these into an if clause, and an else clause
This makes things a bit simpler, but mainly leads to further
simplifications in the next commits.
More rigorous testing of the overlong malformation, yet to be committed,
showed that this needs to be handled specially.  This commit does part
of that.

Perl extended UTF-8 means you are using a start byte not recognized by
any UTF-8 standard.  Suppose it is an overlong sequence that reduces
down to something representable using standard UTF-8.  The string still
used non-standard UTF-8 to get there, so should still be called out when
the input parameters to this function ask for that.  This commit is a
first step towards that.
By not overriding the computed value of malformed input until later in
the function, we can eliminate this temporary variable.  This paves the
way to a much bigger simplification in the next commit.
It turns out that the work being done in the first block is only used in
the second block.  If that block doesn't get executed, the first block's
effort is thrown away.  So fold the first block into the second.  This
results in a bunch of temporaries that were used to communicate between
the blocks being able to be removed.

More detailed comments are added.
Don't execute this loop if it would be pointless.
Make sure it isn't being called with unexpected input

--
Admittedly not much work, but I realized in code reading that there
are function exits that ignore this initialization.  Instead
move the initialization to later, where it is actually needed
Remove excess indentation
More rigorous testing of the overlong malformation, yet to be committed,
showed that this didn't work as intended.

The IS_UTF8_START_BYTE() excludes start bytes that always lead to overlong sequences.  Fortunately the logic caused that to be mostly bypassed.  But this commit fixes it all.
@khwilliamson khwilliamson merged commit f254d77 into Perl:blead Nov 24, 2024
33 checks passed
@khwilliamson khwilliamson deleted the overlong_calc branch November 24, 2024 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant