Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag with shortclosing and with greater than in attribute parsing problem #60

Open
stefanpausch opened this issue Jul 6, 2018 · 4 comments

Comments

@stefanpausch
Copy link

stefanpausch commented Jul 6, 2018

Problem: If an greater than (>) sign (non html entity formatted) exists within at short closing tag attribute the Streamer cannot find the next matching tag and reads until the the end of file, which causes trouble in parsing the returned node with simplexml.

Running xml-string-streamer 0.11.0

Example code:

<?xml version="1.0" encoding="UTF-8"?> <root> <child addressLine3="teststring>"/> <child addressLine3="teststring"/> <child addressLine3="teststring"/> </root>

With the example code the streamer finds the first child and returns ALL xml afterwards (does not find the 2nd/3rd child).

If i configure the StringWalker with expectGT and Short Closing Tag the Parser does not find any nodes at all:

expectGT' => true, 'tagsWithAllowedGT' => array( array("<", "/>"), ),

As i read greater than is a non valid character, but within attributes it is allowed. DOMReader and SimpleXML parse the file without any problems.

SimpleXMLElement Object ( [child] => Array ( [0] => SimpleXMLElement Object ( [@attributes] => Array ( [addressLine3] => teststring> ) ) [1] => SimpleXMLElement Object ( [@attributes] => Array ( [addressLine3] => teststring ) ) [2] => SimpleXMLElement Object ( [@attributes] => Array ( [addressLine3] => teststring ) ) ) )

@prewk
Copy link
Owner

prewk commented Jul 6, 2018

Hi @stefanpausch!

My understanding was that both < and > were required to be escaped, but it seems you are correct about specifically > being exempt.

Could you perhaps make use of the UniqueNode parser instead? It's more resilient to "weird XML" in some regards (but only works if the node name is unique).

@stefanpausch
Copy link
Author

Hello @prewk

thanks alot for the very fast reply! - UniqueNode is a solution for my specific xml file. (Tested it and it is working. Thanks alot!).

I got another XML file which uses subchilds with same names as it parents. In that case UniqueNode
wouldn't be usable - I haven't run into the same problem in those cases, yet

Will you provide a fix for the problem for the default XMLStringStream in a future release?

Btw the problem may only appear on Short Closing Tags and not on normal closed ones.

@prewk
Copy link
Owner

prewk commented Jul 6, 2018

Yeah it's definitely a bug, so I will keep this issue open. I can't make any promises on when a fix can be made, though. Maybe you can preprocess your XML file as a quickfix with some clever sed/awk regexp or something for now.

@stefanpausch
Copy link
Author

Take your time! - My problem is currently solved with UniqueNode and i will adapt my code to the different XML types i am parsing and will switch back to default XMLStringStreamer when your solution is ready. No need to rush things :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants