utf8n_to_uvchr(): Simplify and fix some overlongs edge cases #22757

khwilliamson · 2024-11-17T22:51:34Z

This commits came about from my reading the code, and doing more rigorous testing (the commits for which are WIP) that revealed some corner case issues with handling overlong sequences and ones that overflow the platform's word size. The changes here address the overlong ones. Current test cases do not catch these obscure bugs; those will come in later pull requests.

This set of changes may require a perldelta entry, but it is premature

There are shortcuts available that cut these 8 names to 2.

It turns out that the information generated in this block is only needed if the final conditional in this complicated group of them is true, which checks if the caller wants anything special for certain classes of code points. Because that final condition is subsidiary, the block was getting executed just to be thrown away.

As a first step in simplifying this overly complicated series of conditionals, pull out the first one into a separate 'if'. The next commits will do more.

This hoists a clause in a complex conditional to the 'if' statement above it, converting that to two conditionals from one, while decreasing the number in the much larger interior 'if' by 1. This is in preparation for further simplifications in the next few commits.

This splits these into an if clause, and an else clause

This makes things a bit simpler, but mainly leads to further simplifications in the next commits.

More rigorous testing of the overlong malformation, yet to be committed, showed that this needs to be handled specially. This commit does part of that. Perl extended UTF-8 means you are using a start byte not recognized by any UTF-8 standard. Suppose it is an overlong sequence that reduces down to something representable using standard UTF-8. The string still used non-standard UTF-8 to get there, so should still be called out when the input parameters to this function ask for that. This commit is a first step towards that.

By not overriding the computed value of malformed input until later in the function, we can eliminate this temporary variable. This paves the way to a much bigger simplification in the next commit.

It turns out that the work being done in the first block is only used in the second block. If that block doesn't get executed, the first block's effort is thrown away. So fold the first block into the second. This results in a bunch of temporaries that were used to communicate between the blocks being able to be removed. More detailed comments are added.

Don't execute this loop if it would be pointless.

Make sure it isn't being called with unexpected input --

Admittedly not much work, but I realized in code reading that there are function exits that ignore this initialization. Instead move the initialization to later, where it is actually needed

Remove excess indentation

More rigorous testing of the overlong malformation, yet to be committed, showed that this didn't work as intended. The IS_UTF8_START_BYTE() excludes start bytes that always lead to overlong sequences. Fortunately the logic caused that to be mostly bypassed. But this commit fixes it all.

khwilliamson force-pushed the overlong_calc branch from 236578e to d7e6167 Compare November 24, 2024 18:17

khwilliamson added 14 commits November 24, 2024 11:19

utf8.c: Replace macros by more compact equivalents

9be0c1f

There are shortcuts available that cut these 8 names to 2.

utf8.c: Split conditionals

12285fd

As a first step in simplifying this overly complicated series of conditionals, pull out the first one into a separate 'if'. The next commits will do more.

utf8.c: Further simplify complex conditional

857fe56

This splits these into an if clause, and an else clause

utf8.c: Swap order of blocks

8a4d5f9

This makes things a bit simpler, but mainly leads to further simplifications in the next commits.

utf8.c: Remove intermediate value

88bb717

By not overriding the computed value of malformed input until later in the function, we can eliminate this temporary variable. This paves the way to a much bigger simplification in the next commit.

utf8.c: Don't throw away work

953bbd9

Don't execute this loop if it would be pointless.

utf8n_to_uvchr_msgs_helper: Add assertion

7881d75

Make sure it isn't being called with unexpected input --

utf8n_to_uvchr_msgs_helper: Don't throw away work

7f8a862

Admittedly not much work, but I realized in code reading that there are function exits that ignore this initialization. Instead move the initialization to later, where it is actually needed

utf8.c: White-space only

c9a6111

Remove excess indentation

khwilliamson force-pushed the overlong_calc branch from d7e6167 to bfac0e3 Compare November 24, 2024 18:32

khwilliamson merged commit f254d77 into Perl:blead Nov 24, 2024
33 checks passed

khwilliamson deleted the overlong_calc branch November 24, 2024 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8n_to_uvchr(): Simplify and fix some overlongs edge cases #22757

utf8n_to_uvchr(): Simplify and fix some overlongs edge cases #22757

khwilliamson commented Nov 17, 2024

utf8n_to_uvchr(): Simplify and fix some overlongs edge cases #22757

utf8n_to_uvchr(): Simplify and fix some overlongs edge cases #22757

Conversation

khwilliamson commented Nov 17, 2024