Skip to content

Commit

Permalink
utf8n_to_uvchr_msgs_helper(): Refactor expression
Browse files Browse the repository at this point in the history
More rigorous testing of the overlong malformation, yet to be committed,
showed that this didn't work as intended.

The IS_UTF8_START_BYTE() excludes start bytes that always lead to overlong sequences.  Fortunately the logic caused that to be mostly bypassed.  But this commit fixes it all.
  • Loading branch information
khwilliamson committed Nov 24, 2024
1 parent 437a1fc commit d7e6167
Showing 1 changed file with 11 additions and 7 deletions.
18 changes: 11 additions & 7 deletions utf8.c
Original file line number Diff line number Diff line change
Expand Up @@ -1534,15 +1534,19 @@ Perl__utf8n_to_uvchr_msgs_helper(const U8 *s,
possible_problems |= UTF8_GOT_OVERFLOW;
}

/* Is the first byte of 's' a start byte in the UTF-8 encoding system, not
* excluding starting an overlong sequence? */
#define UTF8_IS_SYNTACTIC_START_BYTE(s) (NATIVE_TO_I8(*s) >= 0xC0)

/* Check for overlong. If no problems so far, 'uv' is the correct code
* point value. Simply see if it is expressible in fewer bytes. Otherwise
* we must look at the UTF-8 byte sequence itself to see if it is for an
* overlong */
* point value. Simply see if it is expressible in fewer bytes. But if
* there are other malformations, we may be still be able to tell if this
* is an overlong by looking at the UTF-8 byte sequence itself */
if ( ( LIKELY(! possible_problems)
&& UNLIKELY(expectlen > (STRLEN) OFFUNISKIP(uv)))
|| ( UNLIKELY(possible_problems)
&& ( UNLIKELY(! UTF8_IS_START(*s0))
|| (UNLIKELY(0 < is_utf8_overlong(s0, s - s0))))))
&& UNLIKELY(expectlen > OFFUNISKIP(uv)))
|| ( UNLIKELY(possible_problems)
&& UTF8_IS_SYNTACTIC_START_BYTE(s0)
&& UNLIKELY(0 < is_utf8_overlong(s0, s - s0))))
{
possible_problems |= UTF8_GOT_LONG;
}
Expand Down

0 comments on commit d7e6167

Please sign in to comment.