-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fallback encoding for non-UTF-8 lines #142
Conversation
For the record, this likely has the same caveat as irssi's original "recode" implementation: very rarely, multi-byte characters can end up split between buffer boundaries, which will cause a broken glyph or an incorrect fallback to the legacy encoding. The occasional broken glyphs will also happen when using only the utf8 decoder without fallback, but the fallback can make the broken behaviour more noticeable. Fixing this would require more extensive changes to how the buffers are handled, so I didn't touch it at this time. |
I'm not following why a fallback encoding is in use here. If you know the encoding to be using, then just set the encoding option instead of some fallback? And as for closing #7, I don't see how this auto detects the encoding. |
The idea is not to auto-detect an arbitrary encoding (which would be infeasible, and did not work in practice with jschardet), but rather to use a fallback to a known single-byte encoding for all lines that aren't valid UTF-8. This is the solution that's used by irssi and mIRC. |
For background, the actual problem this solves is the transition effects of switching from legacy charsets to Unicode. When you're a non-English-speaker hanging out with other native speakers of the same language, you'll almost certainly know which alternative encoding to expect. |
38dbfd4
to
bdb08e6
Compare
I tried to use on my setup, what uses utf-8 and iso-8859-1 in lounge. By default, some but not all ü are showns as ü as expected. Changed backup to latin1, no change. Suprisingly, problem was znc on middle, what buffers and converts it to mojibikes. after buffer messages, it works as expected. futher testing revealed, that problem might be in sqlite buffering. |
Is this going to go anywhere? I get priorities, but six months is pretty glacial. |
Pretty please, either implement this PR or find another way to do it. It's so massively annoying to see messages with ??????????? every day, if you happen to be in a channel where the alphabet consists of more than A to Z. |
bdb08e6
to
7bb64fd
Compare
Rebased the branch to bring it back up to date with the current master. Tested that it still works as expected with The Lounge 3.0.0. Is there something else that I could do to improve the odds of having this accepted? |
I got problems on decoding in thelounge client, when user name contains üõöä characters, if it is allowed from irc server. Other then that, this method works for me. |
I can see how using utf-8-validate is unwanted, as it compiles a native module (so it won't work in a browser for example). What about overriding the � to our own marker (null byte or something) in the iconv library, and if the converted string contains it, fallback to another encoding? See also ashtuchkin/iconv-lite#53 |
|
Also, good point about the nicknames, I'm not sure how to best resolve that. Unicode nicknames are not in IRCv3, but there has apparently been talk of allowing them. Some servers do, but I have no idea if any notable IRC networks are included. I suppose I could try to revise this so that the line is initially decoded and parsed as UTF-8, and maybe go for something like what xPaw suggests to keep the per-line overhead reasonable. Would this sound more like something that the maintainers would be willing to integrate? |
If you recieve data unicode format, you lose information, because � does not contain the bits to fix it. I managed some time ago to keep inbound and outbound connections on different encodings, what allowed to fix the problems and what i think should be a correct way to do it. |
@metsjeesus: Receiving byte strings from the network and interpreting them as characters are two completely different things. What is done here is essentially the same thing as decoding the raw byte string twice using different encodings and picking the correct result. There is no loss, because both operations use the same source data. |
When receiving text from a TCP socket, there's a rare chance with multibyte encodings that characters may end up split between two receive buffers. This can cause spurious decode errors if the buffers are decoded before splitting them into lines. This commit changes TCP / TLS transport behaviour so that incoming data is kept in byte buffers until complete lines can be formed.
Adds option to specify a fallback encoding, which defaults to 'cp1252'. As long as the default encoding is UTF-8 and a fallback encoding is set, any lines that are invalid UTF-8 will be decoded using the fallback encoding. Otherwise decoding behaviour remains unchanged.
7bb64fd
to
aa4a104
Compare
@xPaw: Unfortunately, the default character can't be overridden for any of the encodings that are covered by Node.js - including UTF-8 - so that approach is not possible with iconv-lite. Because there was no feedback from the maintainers, I ended up revising this branch so that the (rare) problem involving split multibyte characters should be fixed now, but otherwise the solution is the same. Retaining full Unicode nickname support would involve moving re-decoding from the transport object into the parsing phase, which would make the patches much more invasive. |
The more extensive changes to fix Unicode nickname support are available here: https://github.com/lorkki/irc-framework/tree/fix-unicode-nicknames |
We've been discussing UTF-8 and fallbacks quite a bit in #ircv3 past couple of days, and here's some observations. The way you currently implemented fallback means if the entire line can't be decoded as UTF8 it will decode it using fallback encoding for the entire line. This approach will bring in more problems (e.g. message-tags will more than likely require to be always UTF8), and in other cases user message can become completely garbled. HexChat for example removed fallbacks from UTF8 specifically because it was causing such problems with invalid utf8 messages. IRCCloud as another example decodes per byte, and fallbacks to latin1 only for characters that fail to decode to UTF8, not for the entire line. |
Yep, I'm aware of that, and I've also spent some time thinking about it after I last revisited this code. The other branch I linked to makes the fallback decoding work only on the message parameters - which in hindsight is probably still too much, and should be limited to just the "trailing" part. It should be considered a rough draft, and I've only tried it in very simple cases so far. I don't think that per-byte fallback for the entire message is really desireable either, because that will result in the behaviour of non-ASCII channel names and nicknames changing depending on whether the fallback is on and what's chosen as the fallback codec. (Potentially harmfully as well: consider a name that's interpreted as latin1 coming in and then re-encoded as UTF-8 when responding. I've personally seen things like a latin1 and UTF-8 version of the same channel name existing side by side.) At this point, attempting to really fix those cases also seems hardly worth the added complexity and UX issues anymore. IRCv3 seems to me to be cautiously edging towards standardising on UTF-8 anyway. Good catch re: the off-by-one error! Seems my local copy doesn't have it, so I must have messed something up in interactive rebase or just forgotten to push my latest changes. |
My opinion is that trying to guess and fallback to other encodings is just not worth the effort. I made a separate PR for a change I noticed you made in regards to buffer: #198 Agree to close this? |
@prawnsalad I propose we close this PR and change |
Also adds configuration option and updates documentation. Defaults to cp1252.
Closes #7