Add fallback encoding for non-UTF-8 lines #142

lorkki · 2018-04-01T17:04:45Z

Also adds configuration option and updates documentation. Defaults to cp1252.

Closes #7

lorkki · 2018-04-03T10:38:50Z

For the record, this likely has the same caveat as irssi's original "recode" implementation: very rarely, multi-byte characters can end up split between buffer boundaries, which will cause a broken glyph or an incorrect fallback to the legacy encoding. The occasional broken glyphs will also happen when using only the utf8 decoder without fallback, but the fallback can make the broken behaviour more noticeable.

Fixing this would require more extensive changes to how the buffers are handled, so I didn't touch it at this time.

prawnsalad · 2018-04-03T20:06:42Z

I'm not following why a fallback encoding is in use here. If you know the encoding to be using, then just set the encoding option instead of some fallback?

And as for closing #7, I don't see how this auto detects the encoding.

lorkki · 2018-04-03T20:17:29Z

The idea is not to auto-detect an arbitrary encoding (which would be infeasible, and did not work in practice with jschardet), but rather to use a fallback to a known single-byte encoding for all lines that aren't valid UTF-8. This is the solution that's used by irssi and mIRC.

lorkki · 2018-04-03T20:36:18Z

For background, the actual problem this solves is the transition effects of switching from legacy charsets to Unicode. When you're a non-English-speaker hanging out with other native speakers of the same language, you'll almost certainly know which alternative encoding to expect.

metsjeesus · 2018-04-18T17:48:49Z

I tried to use on my setup, what uses utf-8 and iso-8859-1 in lounge. By default, some but not all ü are showns as Ã¼ as expected. Changed backup to latin1, no change. Suprisingly, problem was znc on middle, what buffers and converts it to mojibikes. after buffer messages, it works as expected.

futher testing revealed, that problem might be in sqlite buffering.
Nope, still some characters are shown wrong. Its a tricky thing.

lorkki · 2018-09-26T07:55:36Z

Is this going to go anywhere? I get priorities, but six months is pretty glacial.

fnutt · 2018-10-06T19:51:01Z

Pretty please, either implement this PR or find another way to do it. It's so massively annoying to see messages with ??????????? every day, if you happen to be in a channel where the alphabet consists of more than A to Z.

lorkki · 2019-01-31T20:59:17Z

Rebased the branch to bring it back up to date with the current master. Tested that it still works as expected with The Lounge 3.0.0.

Is there something else that I could do to improve the odds of having this accepted?

metsjeesus · 2019-02-04T15:14:01Z

I got problems on decoding in thelounge client, when user name contains üõöä characters, if it is allowed from irc server. Other then that, this method works for me.

xPaw · 2019-02-04T15:31:57Z

I can see how using utf-8-validate is unwanted, as it compiles a native module (so it won't work in a browser for example).

What about overriding the � to our own marker (null byte or something) in the iconv library, and if the converted string contains it, fallback to another encoding?

https://github.com/ashtuchkin/iconv-lite/blob/efbbb0937ca8dda1c14e0b69958b9d6f20771f7a/lib/index.js#L15

See also ashtuchkin/iconv-lite#53

lorkki · 2019-02-04T16:00:27Z

utf-8-validate has a fallback to a JS implementation if native isn't supported. I also tested it using Kiwi IRC back when I originally made this PR.

lorkki · 2019-02-04T20:03:59Z

Also, good point about the nicknames, I'm not sure how to best resolve that. Unicode nicknames are not in IRCv3, but there has apparently been talk of allowing them. Some servers do, but I have no idea if any notable IRC networks are included.

I suppose I could try to revise this so that the line is initially decoded and parsed as UTF-8, and maybe go for something like what xPaw suggests to keep the per-line overhead reasonable. Would this sound more like something that the maintainers would be willing to integrate?

metsjeesus · 2019-02-04T23:19:01Z

If you recieve data unicode format, you lose information, because � does not contain the bits to fix it. I managed some time ago to keep inbound and outbound connections on different encodings, what allowed to fix the problems and what i think should be a correct way to do it.

lorkki · 2019-02-05T08:52:50Z

@metsjeesus: Receiving byte strings from the network and interpreting them as characters are two completely different things. What is done here is essentially the same thing as decoding the raw byte string twice using different encodings and picking the correct result. There is no loss, because both operations use the same source data.

When receiving text from a TCP socket, there's a rare chance with multibyte encodings that characters may end up split between two receive buffers. This can cause spurious decode errors if the buffers are decoded before splitting them into lines. This commit changes TCP / TLS transport behaviour so that incoming data is kept in byte buffers until complete lines can be formed.

Adds option to specify a fallback encoding, which defaults to 'cp1252'. As long as the default encoding is UTF-8 and a fallback encoding is set, any lines that are invalid UTF-8 will be decoded using the fallback encoding. Otherwise decoding behaviour remains unchanged.

lorkki · 2019-02-10T16:41:45Z

@xPaw: Unfortunately, the default character can't be overridden for any of the encodings that are covered by Node.js - including UTF-8 - so that approach is not possible with iconv-lite.

Because there was no feedback from the maintainers, I ended up revising this branch so that the (rare) problem involving split multibyte characters should be fixed now, but otherwise the solution is the same. Retaining full Unicode nickname support would involve moving re-decoding from the transport object into the parsing phase, which would make the patches much more invasive.

lorkki · 2019-02-10T18:18:48Z

The more extensive changes to fix Unicode nickname support are available here: https://github.com/lorkki/irc-framework/tree/fix-unicode-nicknames

xPaw · 2019-03-06T17:44:05Z

We've been discussing UTF-8 and fallbacks quite a bit in #ircv3 past couple of days, and here's some observations.

The way you currently implemented fallback means if the entire line can't be decoded as UTF8 it will decode it using fallback encoding for the entire line. This approach will bring in more problems (e.g. message-tags will more than likely require to be always UTF8), and in other cases user message can become completely garbled.

HexChat for example removed fallbacks from UTF8 specifically because it was causing such problems with invalid utf8 messages.

IRCCloud as another example decodes per byte, and fallbacks to latin1 only for characters that fail to decode to UTF8, not for the entire line.

lorkki · 2019-03-07T10:43:11Z

Yep, I'm aware of that, and I've also spent some time thinking about it after I last revisited this code. The other branch I linked to makes the fallback decoding work only on the message parameters - which in hindsight is probably still too much, and should be limited to just the "trailing" part. It should be considered a rough draft, and I've only tried it in very simple cases so far.

I don't think that per-byte fallback for the entire message is really desireable either, because that will result in the behaviour of non-ASCII channel names and nicknames changing depending on whether the fallback is on and what's chosen as the fallback codec. (Potentially harmfully as well: consider a name that's interpreted as latin1 coming in and then re-encoded as UTF-8 when responding. I've personally seen things like a latin1 and UTF-8 version of the same channel name existing side by side.)

At this point, attempting to really fix those cases also seems hardly worth the added complexity and UX issues anymore. IRCv3 seems to me to be cautiously edging towards standardising on UTF-8 anyway.

Good catch re: the off-by-one error! Seems my local copy doesn't have it, so I must have messed something up in interactive rebase or just forgotten to push my latest changes.

xPaw · 2019-03-07T10:45:38Z

My opinion is that trying to guess and fallback to other encodings is just not worth the effort.

I made a separate PR for a change I noticed you made in regards to buffer: #198

Agree to close this?

xPaw · 2019-10-29T10:27:46Z

@prawnsalad I propose we close this PR and change Multiple (+ auto detected) encoding support in readme to just say irc-fw defaults to UTF-8 with ability to change it, but there are no fallbacks.

lorkki force-pushed the decode-with-fallback branch from 38dbfd4 to bdb08e6 Compare April 8, 2018 08:06

lorkki force-pushed the decode-with-fallback branch from bdb08e6 to 7bb64fd Compare January 31, 2019 20:23

lorkki added 2 commits February 10, 2019 17:18

lorkki force-pushed the decode-with-fallback branch from 7bb64fd to aa4a104 Compare February 10, 2019 16:36

xPaw mentioned this pull request Mar 6, 2019

Use node buffers and decode full lines #198

Merged

xPaw closed this Jan 19, 2020

xPaw mentioned this pull request Feb 7, 2020

Detect wrong encoding thelounge/thelounge#3746

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fallback encoding for non-UTF-8 lines #142

Add fallback encoding for non-UTF-8 lines #142

lorkki commented Apr 1, 2018

lorkki commented Apr 3, 2018

prawnsalad commented Apr 3, 2018

lorkki commented Apr 3, 2018

lorkki commented Apr 3, 2018

metsjeesus commented Apr 18, 2018 •

edited

Loading

lorkki commented Sep 26, 2018

fnutt commented Oct 6, 2018

lorkki commented Jan 31, 2019

metsjeesus commented Feb 4, 2019

xPaw commented Feb 4, 2019 •

edited

Loading

lorkki commented Feb 4, 2019

lorkki commented Feb 4, 2019

metsjeesus commented Feb 4, 2019

lorkki commented Feb 5, 2019

lorkki commented Feb 10, 2019

lorkki commented Feb 10, 2019

xPaw commented Mar 6, 2019

lorkki commented Mar 7, 2019

xPaw commented Mar 7, 2019

xPaw commented Oct 29, 2019

Add fallback encoding for non-UTF-8 lines #142

Add fallback encoding for non-UTF-8 lines #142

Conversation

lorkki commented Apr 1, 2018

lorkki commented Apr 3, 2018

prawnsalad commented Apr 3, 2018

lorkki commented Apr 3, 2018

lorkki commented Apr 3, 2018

metsjeesus commented Apr 18, 2018 • edited Loading

lorkki commented Sep 26, 2018

fnutt commented Oct 6, 2018

lorkki commented Jan 31, 2019

metsjeesus commented Feb 4, 2019

xPaw commented Feb 4, 2019 • edited Loading

lorkki commented Feb 4, 2019

lorkki commented Feb 4, 2019

metsjeesus commented Feb 4, 2019

lorkki commented Feb 5, 2019

lorkki commented Feb 10, 2019

lorkki commented Feb 10, 2019

xPaw commented Mar 6, 2019

lorkki commented Mar 7, 2019

xPaw commented Mar 7, 2019

xPaw commented Oct 29, 2019

metsjeesus commented Apr 18, 2018 •

edited

Loading

xPaw commented Feb 4, 2019 •

edited

Loading