Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser does not use XML encoding #59

Open
vaclavvanik opened this issue Apr 4, 2018 · 5 comments
Open

Parser does not use XML encoding #59

vaclavvanik opened this issue Apr 4, 2018 · 5 comments

Comments

@vaclavvanik
Copy link

I discovered that Parsers does not honor xml encoding. Imo when XML file is not in UTF-8, than getNode() should convert string to UTF-8.

@prewk
Copy link
Owner

prewk commented Apr 4, 2018

It's true that the nodes captured are passed as-is.

Is your opinion that the lib should read any <?xml encoding="X"?> in the beginning and then attempt to convert from X to UTF-8 on every getNode() with iconv?

What makes UTF-8 the "correct" output format?

I don't see it making a lot of sense, sorry. If you read an XML file with a specific encoding, it's up to you to decode it correctly. It's just a PHP string.

@vaclavvanik
Copy link
Author

@prewk Well when using other xml libs developer imho always works with UTF-8 strings. I suggest that your lib should convert strings to UTF-8 when source xml is not in UTF-8.

When you want to leave it as-is, then I suggest ability to detect XML encoding (maybe add getEncoding() method in XmlStringStreamer).

Thanks.

@prewk
Copy link
Owner

prewk commented Apr 5, 2018

In my opinion PHP's UTF-8 support is too much of a messy afterthought to make those assumptions. It would also break backwards compatibility for people using xml-string-streamer.

However, it's an interesting idea to be able to extract the encoding from the beginning of an XML stream. One issue would be that the nature of streaming makes it awkward to actually get that information before starting the actual streaming.

I think the correct way to do it is to use getExtractedContainer() (see https://github.com/prewk/xml-string-streamer#accessing-the-root-element-version-070) inside of the while loop and regexp the possible encoding info from there.

I'll look into it.

@jDolba
Copy link
Contributor

jDolba commented Apr 14, 2021

better late than never :) I am using this library and I resolved this some time ago already
it is not that simple - as it is always with encoding

you have to resolve/detect XML (file) encoding before * you will pass file to this XmlStringStreamer and you have to handle encoding conversions in your custom \Prewk\XmlStringStreamer\StreamInterface implementation
(or you can do it with callback in \Prewk\XmlStringStreamer\Stream\File but I believe it is not possible to easily handle multibyte streams like utf16 utf32 etc)

IMHO:

  • it is NOT responsibility of this library to detect encoding
  • it is true that default \Prewk\XmlStringStreamer\StreamInterface implementations are enough to resolve basic encodings using chunk callback, but to handle multibyte streams (utf16, 32...) properly you need another implementation

I do have this resolved but with many more small dependencies (like EncodingConverter because mb_string_* family is not enough and I do need iconv and other ways how to convert input encoding to say UTF8)

if anyone will be interested I am keen to spend some time and prepare PR here with some generalization of my solution

(* in my experience, base on how trust worthy is you file source, you should not trust even xml encoding attribute, multibytes may provide for example "UTF-16", but in reality BOM character will be "UTF-16BE" or "UTF-16LE" and your entire encoding is broken)

also worth to note that GIT does not support multibyte encoding in text files so handling tests is also fun as well :D

@prewk
Copy link
Owner

prewk commented Apr 23, 2021

Sorry for replying late and thanks a lot for your insight :)

Yeah chunk loading like this library does (and PHP in general) is pretty bad at multibyte encoding unfortunately. It's great that you've found a solution for your problem, I don't even code in PHP anymore so I'm not in the problem-space at all.

..considering I don't code in PHP anymore, it's hard for me to properly review any radical changes to the library, but if you have an idea how to make it better without breaking backwards-compatibility then I'm all ears :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants