Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HTML parsing features #11
base: main
Are you sure you want to change the base?
Add HTML parsing features #11
Changes from 14 commits
67cd354
8c15b01
6a8cb4d
b8af7e5
b8bedbf
bb84284
eb121c0
92d9136
4538750
cd6e190
60c987f
f7616c0
e818b5c
b1f83c7
29f7d0a
eeb5f0a
fcdd93d
2d77cdc
e0f2540
bc61a6a
b3a57be
62fdc48
3dd8fb5
b2bc8bb
2e3278d
55badec
c446d6f
1b5b3e0
2ee59bf
891c25f
2b77947
d5b6869
894b79f
590ac97
4468c8e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to support full HTML docs and HTML fragments, then this method should:
$html
is a full DOM page, thenThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in PHP partial HTML is more common than a full document. Except you are implementing it as some kind of middleware to parse the whole HTML response.
But in general it should support both if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this by using the more general HTML parser, then adding a step where we check if the input HTML is a Page/Doc and selecting the
body
from that. As I was already replacing based on a HTML fragment (the body), supporting fragments as input was rather simple.Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add the most simple fragments here first? And add the more complex/combined ones to the end?
So some of these single-element examples.
I can also imagine that example which shouldn'T be replaced:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added those more simple fragments to the top and even moved the fragment dataset to be the "primary" source.
This feedback clued me off to a flaw in my "early return" logic. So now I'm making sure I check for existence of "Text" nodes rather than children in general. Since your first example has no actual children but does have Text nodes I can work on that was being skipped with my original logic.
Only issue this highlighted is that currently the code does affect the script tag contents. I'll circle back later today/tomorrow to look into that aspect of things. Seems like I'll need to get creative about creating some sort of exclude list to skip Text nodes inside elements like
script
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for
<style>
I think. But like said, this will be experimental - so It's also fine to add a->wip()
todo()
or whatever Pest test for such things. To highlight what has to be fixed/fulfilled to get it out of experimental mode.How about passing Text only without any nodes?
Or without surrounding elements but one somewhere in the middle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I haven't gotten back around to creating an exclude list concept, however I'm going to start taking that on next. Just wanted to report back on my findings related to passing those "plain text" examples. Unfortunately it seems that passing text not wrapped in a node is interpreted in a variety of ambiguous ways.
For instance, the first example technically should have those braces escaped as
<
and>
- otherwise browsers (and PHP DOM) will automatically parse the P tag and add a close. So I've opted to test with the escaped codes and this line is interpreted as XML. So internally it's wrapped with a_root
and as such the XPath to find the inner text nodes fails. Ultimately the result is the text gets returned untouched and Emoji's stay emojis. The second example is basically the same too - just without needing to escape the HTML tag since thecode
tag has a close.Long story short, it got complicated really fast lol. So I think I need to take a step back and focus on the HTML parsing situation and get a handle on how various inputs are being parsed. The DOM library I picked for us uses Symfony/dom-crawler as it's core which in turn uses Mastermind\HTML5 library to compensate for PHP's core dom having some failings.
All that said, my next step will be to create my own test repo to assess the overall situation for PHP's HTML parsing abilities. Since I think we will want a more clear picture about what behaviour I'm seeing from each mechanic. For instance, how is PHP core
ext-dom
treating things, vs howMastermind\HTML5
parser, etc.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More info here: https://github.com/mallardduck/php-html-parsing-report/actions/runs/3191495563/jobs/5207874729
https://github.com/mallardduck/php-html-parsing-report/tree/main/tests/__snapshots__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in general I can say that the symfony Dom crawler is an outstanding package and one of if not the best solution for HTML in PHP.
Like said before: don't escalate this too far. With tests that are meant to fail we at least have a good list of things needed for a final release. And even if you fix some of those, take the chance for more PRs. 😉🎃
Regarding the plain text thing: how about a detector with some cases/match arms?
<div>
, do the job and afterwards take$div->innerHTML
Again, these are just ideas - I'm also personally more happy to merge several smaller and focused PRs than one beast of a PR. As it's also easier to review these smaller ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for that perspective! Always appreciate the sage advice and I think you nailed it. I'll start to refocus this PR to cover less of the edge cases and focus on the most common uses. Then I can adjust the tests a bit to give us a guiding star for future enhancements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And one last thing: personally I would prefer datasets only for the same thing. Meaning really the same test case.
it does not replace emojis in HTML attributes
it replaces emojis in text nodes
So I'm more happy with a lot of copy'n'paste test cases with better/easier descriptions than one test case running through a massive dataset and no one really knows what's tested with that data-point and what is still missing or not covered. So every expectation to the lib should be one test case - and we can bullet proof that expectation by adding multiple variants of it. Like
<a>
and<img/>
for the attributes. And later someone adds<button>
to it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the tests now and followed your advice of individual tests. Great idea as this allows me to more obviously target edge cases. So I've taken advantage of that by adding some explicit expectations to tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news!
I've caught the root of the issue I've been chasing that I thought was something I did wrong. Specifically the way HTML attributes were being cast/converted to their HTML entities equivalent instead of staying unicode.
It turns out that this is caused by DOMDocument itself. I thought it was being caused by either: a) the xpath I found to use, or b) the HTML5 transforming library I opted to choose. In the end, reading the comments on PHP's docs for DOMDocument was my salvation! Specifically the 15 year old comment from
xuanbn
that said:With that new info I added an additional test that shows how the HTML being input determines how DOMDocument treats things up on calling
saveHtml()
. So I think this gets us to a lot better place where I've created a working solution, but we show that it's experimental and highlight the issue with a (skipped) failing test.