-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid UTF-8 character numbers in stringUtf8.ion #65
Comments
For the second-to-last sequence, the final string contains the UTF-8 code units Each sequence contains the equivalent escaped UTF-16 surrogate pair ( Please let me know if I'm misunderstanding something about the issue. |
Right I zeroed in on the wrong thing there. I still have a problem here though. The second string in each of the utf16 surrogate tests contains two I guess the question is: how is the parser supposed to compare string equivalencies without unescaping? If they are required to escape, why should the parser assume this string to be parsed as UTF-16 when un-escaped? My understanding of the spec was that the encoding was left open, however it should be consistent for a give byte stream as we have to choose an encoding up front. Allowing escaped characters makes sense for hard-to-input or unprintable sequences, but not as a method of injecting a character that would otherwise be invalid in the provided context (encoding). If the intent is to allow a mixed stream of UTF-8, 16, and 32 encodings then there must be a way to identify when it changes mid-stream. The spec indicates that the escapes are code points, not that they are tied to any specific encoding. Thanks for taking a look! |
Right, the escapes represent code points. Once you are un-escaping the escape sequences, you are just calculating the Unicode (i.e. not UTF-8 or UTF-16 or any other encoding) code point, which is effectively just an integer that maps to a particular Unicode character. If the stream's encoding is UTF-8, then it will have to escape, e.g., UTF-16 surrogate pairs. That surrogate pair maps to a Unicode code point that could have been equivalently represented natively via UTF-8 code units. So, what you need to be able to do is map an encoding-agnostic sequence of Unicode code points to the String type of your programming language. Hopefully this is straightforward for everything that isn't an escape sequence; for escape sequences you need replace the escape sequence with the code point that you calculate by un-escaping before constructing the String. Once you've done that, the equivalence testing should be simple: just use the language-standard technique for comparing Strings. There are examples of this in the string decoding logic of other Ion text parsing implementations; let me know if you have trouble finding them. |
Alright, we're zeroing in:
To properly encode a UTF-16 surrogate pair into UTF-8 I would not escape the pair individually, I would determine the code point and translate that into UTF-8.
I think I object to the two
Ironically, that is what is causing this problem for me. If the distinct code points (i.e. one per escape) is translated directly into its Unicode, those values individually are in the low surrogate and high surrogate ranges (respectively), which are only valid when adjacent as 16-bit values in UTF-16, where they would be interpreted as a single code point in the supplementary planes. I'm also a little confused about why this would be desired behavior, though I'm guessing it's due to the environment Ion grew up in (it's essentially leaking implementation details of UTF-16 into the string spec for Ion). Looking at the Python implementation I see that we are not treating a 4 digit hex escape as a code point, but instead a pair of UTF-16 "code units", as Wikipedia puts it. If this behavior is necessary, I think a bit more information on how escapes are intended to be processed should exist for the 4 digit hex escapes. |
It turns out the Strings and Clobs section actually addresses this and has the same concerns I do:
|
Given that context, could we either:
I don't like the idea of leaving them, as it encourages implementation of incorrect behavior. I'd be happy to submit a PR for either. |
Let's break this down a bit. There's no such thing as a "4-digit hexadecimal Unicode code point". A code point is an integer, hexadecimal is an encoding, those are two independent things. A code point can be encoded in many different ways. Every equivalence set in that file consists of a single Unicode code point, encoded several different way. The comments in the file are misleading: there's no UTF-16 or UTF-32 anywhere in there. The text file is UTF-8, full stop. The encoding differences test the various flavors of escape sequences that the Ion text spec defines for use in string and symbol elements. The tests represent the intent that all of the escape flavors are interchangable: they all represent exactly the same content in the data model. The differences are purely syntactic, not semantic, and Ion implentations are required to treat them accordingly. Suppose I have a UTF-8 encoded text file made up of the 12 (ASCII) characters For code points that can be encoded in four hexadecimal digits, falling into what Unicode calls the "Basic Multilingal Plane", Ion offers a shorter escape sequence using The spec could use some clarification on this subject. The intent is that when As you've observed, this is confusing, and Ion only allows it for compatibility with JSON, which doesn't have The passage you quote from Strings and Clobs (thank you!) is outdated; the surrogate pairing behavior was poorly understood and specified until relatively recently in Ion's lifetime. It's now incorrect: Does this help? |
Yes, that helps a lot, thank you! I didn't realize that was a limitation of JSON, and without that explanation it would have felt bad to implement that behavior without sufficient justification (lifting the semantics of UTF-16 into an escaped version of itself - why on earth would we do that?). I don't know if you looked at the table above that paragraph, but it also needs updating. It contains a pretty clear breakdown of the semantics I was describing above (including the phrase I think it would be important to capture the following points when documenting this:
|
I ran up against a challenge with a specific test:
iontestdata/good/equivs/utf8/stringUtf8.ion
.The final two equivalency sexps in the test include invalid UTF-8 characters (the UTF16 surrogate characters). According to RFC3629:
The Ion text spec has the following to say:
Looking at the four bytes that make up the challenging characters in question, both sequences start with
0b1111_0xxx
which indicates a four byte UTF-8 code point. This confirms the UTF-8 encoding of the characters within the invalid range, and indicates that these unescaped strings are invalid UTF-8.This was identified by Rust's String type (which only supports valid UTF-8).
I can submit a PR to remove those two equivalencies if that seems like the correct approach here. It seems incorrect to include them as escape sequences as well in a document that is UTF-8 encoded unless there is some kind of indication that it should be encoded as UTF-16 prior to escape expansion since the resulting UTF-8 string would also be invalid if the escape were expanded. I would suggest making specific files for different Unicode encodings if this kind of test is desired for parsing UTF-16 or UTF-32 data.
The text was updated successfully, but these errors were encountered: