-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated encoding detection is still not working properly #312
Comments
I've assembled some test files in #314 Luckily, it was quite fast to find enough files to get some conclusions:
So long story short: as soon as we do not have information about charset in HTTP My proposition is to simplify even further current logic to stop guessing, it is just too easy to get fooled and decode only bullshit (which is then reencoded to UTF-8 and becomes valid bullshit) or fail to decode with wrongly guessed charset. I also consider we should given more confidence to the encoding found in document first 1024 bytes (even if it is slower to process) than to the one found in HTTP Logic would be simplified as such:
This means that we could have higher chances to have bad content in the ZIM when source website is badly configured since we trust the charset declared in the document header or HTTP header ; but then it is clearly not our fault ; and guessing proved to be just doing more harm than good. I also considered an alternative where we would not decode at all (or only fragments of the file) since the structure of the document is normally only ascii, special characters are found in strings only. While this could work and help from a technical perspective, it means that we break the ZIM specification which says that all content must be stored encoded in UTF-8 in the ZIM (so we must decode all strings. period.). And the technical implementation is not going to be simple / straightforward. |
Before #302, all JS / JSON (and CSS) documents were supposed to be encoded in UTF-8.
This was unreliable because some sites are not using "UTF-8" for encoding these documents.
The PR hence modified code to use the "automated detection" already in place for HTML document.
With this automated detection algorithm, we try in order:
chardet
most probably encodingUnfortunately, it looks like this automatic detection is not that reliable. This is especially visible for JS because when no encoding is received from HTTP header, we do not have encoding specified in first bytes of the document either, and so it means we are back to simply relying on the fact that
chardet
most probable encoding is going to work.While it gave good results on the files we tested, it seems that
chardet
is also very poorly performing on other situations.E.g. it is failing to properly decode "https://www.cloudflare.com/vendor/onetrust/scripttemplates/202308.2.0/otBannerSdk.js" (which is just UTF-8 but detected as Windows-1252 by chardet after a very long heuristic - about 3 secs on my Linux server).
Given all these problems, it is now clear that we first need to assemble a test set of files that are now known to be difficult to decode based on our experience and gather strings in those files which we know how they should be decoded.
Then based on this test set we will be able to decide whether an automated approach still seems feasible (and which one) or if it is just impossible and the only most reasonable compromise is to allow to specify encoding to use when unknown at the CLI (with potentially multiple encoding needed per conversion, so potentially needing pattern matching rules ...).
Nota: my hopes for an automated solution are decreasing ; while researching a bit the web I discovered that even "big" libraries like httpx are struggling on the matter. It looks like they started with
chardet
, then switched to fully manual heurisitic (encode/httpx#1269) and are now usingcharset-normalizer
(encode/httpx#1791). And "spoiler",charset-normalizer
is not properly decoding the content at https://www.marxists.org/espanol/menu.js (which is one of our test cases).We should also need to keep in mind that bad characters exists "for real" on some document on the web (see #221 where we have a document from https://www.solidarite-numerique.fr/tutoriels/comprendre-les-cookies/ which is mostly only UTF-8 chars - accent works as expected, ... - but contains a bad character which is impossible to decode in UTF-8) which make this decoding task even harder.
The text was updated successfully, but these errors were encountered: