Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fallback encoding for non-UTF-8 lines #142

Closed
wants to merge 2 commits into from

Conversation

lorkki
Copy link

@lorkki lorkki commented Apr 1, 2018

Also adds configuration option and updates documentation. Defaults to cp1252.

Closes #7

@lorkki
Copy link
Author

lorkki commented Apr 3, 2018

For the record, this likely has the same caveat as irssi's original "recode" implementation: very rarely, multi-byte characters can end up split between buffer boundaries, which will cause a broken glyph or an incorrect fallback to the legacy encoding. The occasional broken glyphs will also happen when using only the utf8 decoder without fallback, but the fallback can make the broken behaviour more noticeable.

Fixing this would require more extensive changes to how the buffers are handled, so I didn't touch it at this time.

@prawnsalad
Copy link
Member

I'm not following why a fallback encoding is in use here. If you know the encoding to be using, then just set the encoding option instead of some fallback?

And as for closing #7, I don't see how this auto detects the encoding.

@lorkki
Copy link
Author

lorkki commented Apr 3, 2018

The idea is not to auto-detect an arbitrary encoding (which would be infeasible, and did not work in practice with jschardet), but rather to use a fallback to a known single-byte encoding for all lines that aren't valid UTF-8. This is the solution that's used by irssi and mIRC.

@lorkki
Copy link
Author

lorkki commented Apr 3, 2018

For background, the actual problem this solves is the transition effects of switching from legacy charsets to Unicode. When you're a non-English-speaker hanging out with other native speakers of the same language, you'll almost certainly know which alternative encoding to expect.

@metsjeesus
Copy link

metsjeesus commented Apr 18, 2018

I tried to use on my setup, what uses utf-8 and iso-8859-1 in lounge. By default, some but not all ü are showns as ü as expected. Changed backup to latin1, no change. Suprisingly, problem was znc on middle, what buffers and converts it to mojibikes. after buffer messages, it works as expected.

futher testing revealed, that problem might be in sqlite buffering.
Nope, still some characters are shown wrong. Its a tricky thing.

@lorkki
Copy link
Author

lorkki commented Sep 26, 2018

Is this going to go anywhere? I get priorities, but six months is pretty glacial.

@fnutt
Copy link

fnutt commented Oct 6, 2018

Pretty please, either implement this PR or find another way to do it. It's so massively annoying to see messages with ??????????? every day, if you happen to be in a channel where the alphabet consists of more than A to Z.

@lorkki
Copy link
Author

lorkki commented Jan 31, 2019

Rebased the branch to bring it back up to date with the current master. Tested that it still works as expected with The Lounge 3.0.0.

Is there something else that I could do to improve the odds of having this accepted?

@metsjeesus
Copy link

I got problems on decoding in thelounge client, when user name contains üõöä characters, if it is allowed from irc server. Other then that, this method works for me.

@xPaw
Copy link
Collaborator

xPaw commented Feb 4, 2019

I can see how using utf-8-validate is unwanted, as it compiles a native module (so it won't work in a browser for example).

What about overriding the � to our own marker (null byte or something) in the iconv library, and if the converted string contains it, fallback to another encoding?

https://github.com/ashtuchkin/iconv-lite/blob/efbbb0937ca8dda1c14e0b69958b9d6f20771f7a/lib/index.js#L15

See also ashtuchkin/iconv-lite#53

@lorkki
Copy link
Author

lorkki commented Feb 4, 2019

utf-8-validate has a fallback to a JS implementation if native isn't supported. I also tested it using Kiwi IRC back when I originally made this PR.

@lorkki
Copy link
Author

lorkki commented Feb 4, 2019

Also, good point about the nicknames, I'm not sure how to best resolve that. Unicode nicknames are not in IRCv3, but there has apparently been talk of allowing them. Some servers do, but I have no idea if any notable IRC networks are included.

I suppose I could try to revise this so that the line is initially decoded and parsed as UTF-8, and maybe go for something like what xPaw suggests to keep the per-line overhead reasonable. Would this sound more like something that the maintainers would be willing to integrate?

@metsjeesus
Copy link

If you recieve data unicode format, you lose information, because � does not contain the bits to fix it. I managed some time ago to keep inbound and outbound connections on different encodings, what allowed to fix the problems and what i think should be a correct way to do it.

@lorkki
Copy link
Author

lorkki commented Feb 5, 2019

@metsjeesus: Receiving byte strings from the network and interpreting them as characters are two completely different things. What is done here is essentially the same thing as decoding the raw byte string twice using different encodings and picking the correct result. There is no loss, because both operations use the same source data.

When receiving text from a TCP socket, there's a rare chance with
multibyte encodings that characters may end up split between two receive
buffers. This can cause spurious decode errors if the buffers are
decoded before splitting them into lines.

This commit changes TCP / TLS transport behaviour so that incoming data
is kept in byte buffers until complete lines can be formed.
Adds option to specify a fallback encoding, which defaults to 'cp1252'.

As long as the default encoding is UTF-8 and a fallback encoding is
set, any lines that are invalid UTF-8 will be decoded using the
fallback encoding. Otherwise decoding behaviour remains unchanged.
@lorkki
Copy link
Author

lorkki commented Feb 10, 2019

@xPaw: Unfortunately, the default character can't be overridden for any of the encodings that are covered by Node.js - including UTF-8 - so that approach is not possible with iconv-lite.

Because there was no feedback from the maintainers, I ended up revising this branch so that the (rare) problem involving split multibyte characters should be fixed now, but otherwise the solution is the same. Retaining full Unicode nickname support would involve moving re-decoding from the transport object into the parsing phase, which would make the patches much more invasive.

@lorkki
Copy link
Author

lorkki commented Feb 10, 2019

The more extensive changes to fix Unicode nickname support are available here: https://github.com/lorkki/irc-framework/tree/fix-unicode-nicknames

@xPaw
Copy link
Collaborator

xPaw commented Mar 6, 2019

We've been discussing UTF-8 and fallbacks quite a bit in #ircv3 past couple of days, and here's some observations.

The way you currently implemented fallback means if the entire line can't be decoded as UTF8 it will decode it using fallback encoding for the entire line. This approach will bring in more problems (e.g. message-tags will more than likely require to be always UTF8), and in other cases user message can become completely garbled.

HexChat for example removed fallbacks from UTF8 specifically because it was causing such problems with invalid utf8 messages.

IRCCloud as another example decodes per byte, and fallbacks to latin1 only for characters that fail to decode to UTF8, not for the entire line.

@lorkki
Copy link
Author

lorkki commented Mar 7, 2019

Yep, I'm aware of that, and I've also spent some time thinking about it after I last revisited this code. The other branch I linked to makes the fallback decoding work only on the message parameters - which in hindsight is probably still too much, and should be limited to just the "trailing" part. It should be considered a rough draft, and I've only tried it in very simple cases so far.

I don't think that per-byte fallback for the entire message is really desireable either, because that will result in the behaviour of non-ASCII channel names and nicknames changing depending on whether the fallback is on and what's chosen as the fallback codec. (Potentially harmfully as well: consider a name that's interpreted as latin1 coming in and then re-encoded as UTF-8 when responding. I've personally seen things like a latin1 and UTF-8 version of the same channel name existing side by side.)

At this point, attempting to really fix those cases also seems hardly worth the added complexity and UX issues anymore. IRCv3 seems to me to be cautiously edging towards standardising on UTF-8 anyway.

Good catch re: the off-by-one error! Seems my local copy doesn't have it, so I must have messed something up in interactive rebase or just forgotten to push my latest changes.

@xPaw
Copy link
Collaborator

xPaw commented Mar 7, 2019

My opinion is that trying to guess and fallback to other encodings is just not worth the effort.

I made a separate PR for a change I noticed you made in regards to buffer: #198

Agree to close this?

@xPaw
Copy link
Collaborator

xPaw commented Oct 29, 2019

@prawnsalad I propose we close this PR and change Multiple (+ auto detected) encoding support in readme to just say irc-fw defaults to UTF-8 with ability to change it, but there are no fallbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encoding autodetection
5 participants