-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser does not use XML encoding #59
Comments
It's true that the nodes captured are passed as-is. Is your opinion that the lib should read any What makes UTF-8 the "correct" output format? I don't see it making a lot of sense, sorry. If you read an XML file with a specific encoding, it's up to you to decode it correctly. It's just a PHP string. |
@prewk Well when using other xml libs developer imho always works with UTF-8 strings. I suggest that your lib should convert strings to UTF-8 when source xml is not in UTF-8. When you want to leave it as-is, then I suggest ability to detect XML encoding (maybe add Thanks. |
In my opinion PHP's UTF-8 support is too much of a messy afterthought to make those assumptions. It would also break backwards compatibility for people using xml-string-streamer. However, it's an interesting idea to be able to extract the encoding from the beginning of an XML stream. One issue would be that the nature of streaming makes it awkward to actually get that information before starting the actual streaming. I think the correct way to do it is to use I'll look into it. |
better late than never :) I am using this library and I resolved this some time ago already you have to resolve/detect XML (file) encoding before * you will pass file to this XmlStringStreamer and you have to handle encoding conversions in your custom IMHO:
I do have this resolved but with many more small dependencies (like EncodingConverter because mb_string_* family is not enough and I do need iconv and other ways how to convert input encoding to say UTF8) if anyone will be interested I am keen to spend some time and prepare PR here with some generalization of my solution (* in my experience, base on how trust worthy is you file source, you should not trust even also worth to note that GIT does not support multibyte encoding in text files so handling tests is also fun as well :D |
Sorry for replying late and thanks a lot for your insight :) Yeah chunk loading like this library does (and PHP in general) is pretty bad at multibyte encoding unfortunately. It's great that you've found a solution for your problem, I don't even code in PHP anymore so I'm not in the problem-space at all. ..considering I don't code in PHP anymore, it's hard for me to properly review any radical changes to the library, but if you have an idea how to make it better without breaking backwards-compatibility then I'm all ears :) |
I discovered that
Parsers
does not honor xml encoding. Imo when XML file is not in UTF-8, thangetNode()
should convert string to UTF-8.The text was updated successfully, but these errors were encountered: