-
-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YamlStream.Load with JSON with emojis (even escaped) fails: "While scanning a quoted scalar, found invalid Unicode character escape code." #838
Comments
Those utf-8 codes are actually invalid codes for utf-8. According to Wikipedia anyways. I wonder if the dot net core json library sees your character as 2 separate characters because it might be utf-16. I’m not able to actually dig in to this too much right now, but I’ll try and take a closer look tonight. I wonder if you can specify the encoding in the json serializer and set it to utf-16 and see what happens. |
UTF-16? But JSON is always UTF-8 by definition, isn't it? In any case, I can find nothing in System.Text.Json about UTF-16. Let me know if I'm wrong. |
Also, AFAIK many emojis are composed of 2 or even more unicode characters. This SO question and answers may be helpful. Finally, I highly doubt that System.Text.Json (the official .NET JSON serialization API) is doing things wrong. That would require very concrete proof. |
This will also help me in getting a fix in dotnet/runtime#42847 not sure when I’ll get it done though. |
Is there any workaround I can apply now, for converting JSON to YAML? The emojis etc. doesn't have to be unescaped; I'm OK with anything that preserves the escape codes. I simply want my JSON content converted to YAML (tweaked using a YAML visitor). Now it seems that the YAML conversion is attempting to decode escaped stuff, and is failing at that. |
Possibly relevant: The Wikipedia article on JSON, section "Character encoding", says:
The latter is exactly what System.Text.Json does. And since YAML is a superset of JSON, I would expect any YAML implementation, such as YamlDotNet, to support such surrogate pairs. |
Ideally I would like unescaped output (i.e., emojis in the YAML), but at least the following workaround lets me preserve the surrogate pair escape codes: let unicodeEscapeCodePlaceholder = Guid.NewGuid().ToString()
let escapeUnicodeEscapeCodes (str: string) =
str.Replace(@"\u", unicodeEscapeCodePlaceholder)
let unEscapeUnicodeEscapeCodes (str: string) =
str.Replace(unicodeEscapeCodePlaceholder, @"\u")
let formatAsYaml json =
let json = escapeUnicodeEscapeCodes json
let yaml = (* load with YamlStream and transform to YAML *)
unEscapeUnicodeEscapeCodes yaml |
I definitely do agree that utf8 surrogate pairs should be supported. I’m hoping to have some time this weekend to look at it. I did find where it was throwing a fit though. |
Happy to hear it! 😁 |
I've been doing a lot of research on this, through the unicode spec and all that. The surrogate pairs are a way of putting utf-16 and utf-32 characters in a utf-8 file. JSON does this by escaping them, in raw UTF-8 files, its done at the byte level (from my understanding). We just need to support the escaped version, just like the YAML spec shows (if I recall). It has a lot to deal with bit masking and what not, it's pretty complicated so it may take some time, but it may go quicker than I think. I'll let you know when I have a PR ready for it by linking it to this issue. |
Thanks! I have no idea what the internals of YamlDotNet is doing, but wouldn't it work to just keep the loaded text as-is and not attempting to decode the escape sequences, which is seems like it's doing now? |
This fix is going to be released in the latest nuget package which should be available in about 10-15 minutes. |
It works. Thanks a lot! 😊 |
I am serializing something to JSON with System.Text.Json, and then converting it to YAML using
YamlStream
. However, if the JSON contains an emoji, even if it's escaped,YamlStream.Load
throws:Code to reproduce:
The text was updated successfully, but these errors were encountered: