Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8n_to_uvchr(): Simplify and fix some overlongs edge cases #22757

Merged
merged 14 commits into from
Nov 24, 2024

Commits on Nov 24, 2024

  1. utf8.c: Replace macros by more compact equivalents

    There are shortcuts available that cut these 8 names to 2.
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    9be0c1f View commit details
    Browse the repository at this point in the history
  2. utf8.c: Move most important conditional to be first

    It turns out that the information generated in this block is only needed
    if the final conditional in this complicated group of them is true,
    which checks if the caller wants anything special for certain classes of
    code points.  Because that final condition is subsidiary, the block was
    getting executed just to be thrown away.
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    a51ae5e View commit details
    Browse the repository at this point in the history
  3. utf8.c: Split conditionals

    As a first step in simplifying this overly complicated series of
    conditionals, pull out the first one into a separate 'if'.  The next
    commits will do more.
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    12285fd View commit details
    Browse the repository at this point in the history
  4. utf8.c: Further simplify a complex conditional

    This hoists a clause in a complex conditional to the 'if' statement
    above it, converting that to two conditionals from one, while decreasing
    the number in the much larger interior 'if' by 1.
    
    This is in preparation for further simplifications in the next few
    commits.
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    3d86fdf View commit details
    Browse the repository at this point in the history
  5. utf8.c: Further simplify complex conditional

    This splits these into an if clause, and an else clause
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    857fe56 View commit details
    Browse the repository at this point in the history
  6. utf8.c: Swap order of blocks

    This makes things a bit simpler, but mainly leads to further
    simplifications in the next commits.
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    8a4d5f9 View commit details
    Browse the repository at this point in the history
  7. utf8.c: Check specially for perl-extended UTF-8

    More rigorous testing of the overlong malformation, yet to be committed,
    showed that this needs to be handled specially.  This commit does part
    of that.
    
    Perl extended UTF-8 means you are using a start byte not recognized by
    any UTF-8 standard.  Suppose it is an overlong sequence that reduces
    down to something representable using standard UTF-8.  The string still
    used non-standard UTF-8 to get there, so should still be called out when
    the input parameters to this function ask for that.  This commit is a
    first step towards that.
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    8a3b341 View commit details
    Browse the repository at this point in the history
  8. utf8.c: Remove intermediate value

    By not overriding the computed value of malformed input until later in
    the function, we can eliminate this temporary variable.  This paves the
    way to a much bigger simplification in the next commit.
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    88bb717 View commit details
    Browse the repository at this point in the history
  9. utf8.c: Combine two blocks

    It turns out that the work being done in the first block is only used in
    the second block.  If that block doesn't get executed, the first block's
    effort is thrown away.  So fold the first block into the second.  This
    results in a bunch of temporaries that were used to communicate between
    the blocks being able to be removed.
    
    More detailed comments are added.
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    2286cf0 View commit details
    Browse the repository at this point in the history
  10. utf8.c: Don't throw away work

    Don't execute this loop if it would be pointless.
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    953bbd9 View commit details
    Browse the repository at this point in the history
  11. utf8n_to_uvchr_msgs_helper: Add assertion

    Make sure it isn't being called with unexpected input
    
    --
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    7881d75 View commit details
    Browse the repository at this point in the history
  12. utf8n_to_uvchr_msgs_helper: Don't throw away work

    Admittedly not much work, but I realized in code reading that there
    are function exits that ignore this initialization.  Instead
    move the initialization to later, where it is actually needed
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    7f8a862 View commit details
    Browse the repository at this point in the history
  13. utf8.c: White-space only

    Remove excess indentation
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    c9a6111 View commit details
    Browse the repository at this point in the history
  14. utf8n_to_uvchr_msgs_helper(): Refactor expression

    More rigorous testing of the overlong malformation, yet to be committed,
    showed that this didn't work as intended.
    
    The IS_UTF8_START_BYTE() excludes start bytes that always lead to overlong sequences.  Fortunately the logic caused that to be mostly bypassed.  But this commit fixes it all.
    khwilliamson committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    bfac0e3 View commit details
    Browse the repository at this point in the history