Add HTML parsing features #11

mallardduck · 2022-10-02T21:17:39Z

This is a WIP PR to address #2 - however I will resubmit/force-push this branch once we have #10 merged in to fix PHP 8.1 support fully.

As it stands now, here are the remaining concerns to address:

~~How to flag as experimental? Is there an PHPDoc annotation I can apply to the class?~~ Might be addressed, waiting for confirmation.
Based on the initial issue it sounded like a new method on Twemoji but I opted for a new Replacer class. This way the new functionality doesn't directly modify the base Twemoji behaviour. (If this works well for HTML I can work on a follow up PR to address adding a Markdown Replacer class)
How should the HTML replacer filter nodes for emoji replacement? I need to do more testing around this and add more test data to the pest datasets. This will help us be more informed on how PHP will filter things out best. Solved: Now using XPath to filter only the Text type nodes from the document. In the DOM the Text node type represents the inner Text of an element. So by filtering the page first to the body, then selecting all text elements we can safely catch all emoji.

src/HtmlReplacer.php

mallardduck · 2022-10-02T23:48:30Z

src/HtmlReplacer.php

+        }
+    }
+
+    public function parse(string $html): string


If we need to support full HTML docs and HTML fragments, then this method should:

Immediately determine if the input $html is a full DOM page, then

either use HtmlPage (used here) and work based on the Body, or

use the HtmlPageCrawler to parse the fragment.

I think that in PHP partial HTML is more common than a full document. Except you are implementing it as some kind of middleware to parse the whole HTML response.
But in general it should support both if possible.

Addressed this by using the more general HTML parser, then adding a step where we check if the input HTML is a Page/Doc and selecting the body from that. As I was already replacing based on a HTML fragment (the body), supporting fragments as input was rather simple.

src/HtmlReplacer.php

Gummibeer · 2022-10-04T14:42:33Z

tests/Datasets/HtmlContent.php

+HTML,
+]);
+
+dataset('html-fragments', [


Could we add the most simple fragments here first? And add the more complex/combined ones to the end?

<p>🚀</p>

<img src="" alt="🎉"/>

<a href="" title="🎈">Link ⛓️</a>

So some of these single-element examples.

I can also imagine that example which shouldn'T be replaced:

<script>document.innerHTML = '🤷‍♂️';</script>

Added those more simple fragments to the top and even moved the fragment dataset to be the "primary" source.

This feedback clued me off to a flaw in my "early return" logic. So now I'm making sure I check for existence of "Text" nodes rather than children in general. Since your first example has no actual children but does have Text nodes I can work on that was being skipped with my original logic.

Only issue this highlighted is that currently the code does affect the script tag contents. I'll circle back later today/tomorrow to look into that aspect of things. Seems like I'll need to get creative about creating some sort of exclude list to skip Text nodes inside elements like script.

Same for <style> I think. But like said, this will be experimental - so It's also fine to add a ->wip() todo() or whatever Pest test for such things. To highlight what has to be fixed/fulfilled to get it out of experimental mode.

How about passing Text only without any nodes?

This is some fancy-💃 Markdown/WYSIWYG text with surrounding <p> tags disabled. 🎉

Or without surrounding elements but one somewhere in the middle?

This is some fancy-💃 Markdown/WYSIWYG text with surrounding <code><p></code> tags disabled. 🎉

So I haven't gotten back around to creating an exclude list concept, however I'm going to start taking that on next. Just wanted to report back on my findings related to passing those "plain text" examples. Unfortunately it seems that passing text not wrapped in a node is interpreted in a variety of ambiguous ways.

For instance, the first example technically should have those braces escaped as < and > - otherwise browsers (and PHP DOM) will automatically parse the P tag and add a close. So I've opted to test with the escaped codes and this line is interpreted as XML. So internally it's wrapped with a _root and as such the XPath to find the inner text nodes fails. Ultimately the result is the text gets returned untouched and Emoji's stay emojis. The second example is basically the same too - just without needing to escape the HTML tag since the code tag has a close.

Long story short, it got complicated really fast lol. So I think I need to take a step back and focus on the HTML parsing situation and get a handle on how various inputs are being parsed. The DOM library I picked for us uses Symfony/dom-crawler as it's core which in turn uses Mastermind\HTML5 library to compensate for PHP's core dom having some failings.

All that said, my next step will be to create my own test repo to assess the overall situation for PHP's HTML parsing abilities. Since I think we will want a more clear picture about what behaviour I'm seeing from each mechanic. For instance, how is PHP core ext-dom treating things, vs how Mastermind\HTML5 parser, etc.

More info here: https://github.com/mallardduck/php-html-parsing-report/actions/runs/3191495563/jobs/5207874729

https://github.com/mallardduck/php-html-parsing-report/tree/main/tests/__snapshots__

So in general I can say that the symfony Dom crawler is an outstanding package and one of if not the best solution for HTML in PHP.

Like said before: don't escalate this too far. With tests that are meant to fail we at least have a good list of things needed for a final release. And even if you fix some of those, take the chance for more PRs. 😉🎃

Regarding the plain text thing: how about a detector with some cases/match arms?

full HTML document

partial HTML but starting/ending with a tag or even surrounded by one

text with HTML sprinkles - solution could be to surround it with <div>, do the job and afterwards take $div->innerHTML

Again, these are just ideas - I'm also personally more happy to merge several smaller and focused PRs than one beast of a PR. As it's also easier to review these smaller ones.

Thank you for that perspective! Always appreciate the sage advice and I think you nailed it. I'll start to refocus this PR to cover less of the edge cases and focus on the most common uses. Then I can adjust the tests a bit to give us a guiding star for future enhancements.

And one last thing: personally I would prefer datasets only for the same thing. Meaning really the same test case.

it does not replace emojis in HTML attributes

<a title="🚀">hello</a> <img alt="🖼️"/>

it replaces emojis in text nodes

<p>🎉</p> <div>💃</div>

So I'm more happy with a lot of copy'n'paste test cases with better/easier descriptions than one test case running through a massive dataset and no one really knows what's tested with that data-point and what is still missing or not covered. So every expectation to the lib should be one test case - and we can bullet proof that expectation by adding multiple variants of it. Like <a> and <img/> for the attributes. And later someone adds <button> to it.

I've refactored the tests now and followed your advice of individual tests. Great idea as this allows me to more obviously target edge cases. So I've taken advantage of that by adding some explicit expectations to tests.

Good news!

I've caught the root of the issue I've been chasing that I thought was something I did wrong. Specifically the way HTML attributes were being cast/converted to their HTML entities equivalent instead of staying unicode.

It turns out that this is caused by DOMDocument itself. I thought it was being caused by either: a) the xpath I found to use, or b) the HTML5 transforming library I opted to choose. In the end, reading the comments on PHP's docs for DOMDocument was my salvation! Specifically the 15 year old comment from xuanbn that said:

I have discovered that, to help loadHTML() process utf file correctly, the meta tag should come first, before any utf string appear. For example, this HTML file. <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> <title> Vietnamese - Tiếng Việt</title> </head> <body></body> </html>

With that new info I added an additional test that shows how the HTML being input determines how DOMDocument treats things up on calling saveHtml(). So I think this gets us to a lot better place where I've created a working solution, but we show that it's experimental and highlight the issue with a (skipped) failing test.

mallardduck · 2022-10-17T17:44:26Z

tests/Unit/HtmlReplacerPageTest.php

+<!DOCTYPE html>
+<html lang="en">
+    <head>
+    	<meta http-equiv="content-type" content="text/html; charset=utf-8">


This is the "correct" meta tag to set charset to UTF8.
This will allow DOMDocument to preserve the raw Emoji in the document.

mallardduck · 2022-10-17T17:45:09Z

tests/Unit/HtmlReplacerPageTest.php

+<!DOCTYPE html>
+<html lang="en">
+    <head>
+        <meta charset="UTF-8">


This is the "incorrect" method to set charset to UTF8.
Unfortunately it will not be respected by DOMDocument and Emoji will become HTML entities.

I think this can be fixed before merge - we should just wrap all fragments in our own HTML page that uses the correct charset meta tag. This will ensure they are encoded correctly and then we can capture the fragment from our "template".

mallardduck · 2022-10-17T17:45:39Z

...acerPageTest__it_will_replace_the_Emoji_on_page,_but_not_in_head_-_NEW_Charset_Method__1.txt

+        <h1>Do a quick kickflip! <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f6f9.png" alt="🛹" width="72" height="72" loading="lazy" class="twemoji"></h1>
+        <p>This is HTML text that should be replaced, but the emoji in the head should not.</p>
+        <h2>Time for a CRAB RAVE!</h2>
+        <p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"></p>


See here, how the alt HTML attribute is still a valid Emoji.

mallardduck · 2022-10-17T17:46:43Z

tests/__snapshots__/HtmlReplacerFragmentTest__it_can_handle_text_with_an_outer_P_tag__1.txt

@@ -0,0 +1 @@
+<p>This is some fancy-<img src="https://twemoji.maxcdn.com/v/latest/72x72/1f483.png" alt="&#128131;" width="72" height="72" loading="lazy" class="twemoji"> Markdown/WYSIWYG text with surrounding <p> tags enabled. <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f389.png" alt="&#127881;" width="72" height="72" loading="lazy" class="twemoji"></p></p>


However with this fragment over here it's obvious DOMDocument executes the normal behavior. The alt attribute is incorrectly encoded to html entity.

Gummibeer · 2022-10-17T18:02:19Z

Just let me know when you want me to review and merge it^^

mallardduck · 2022-10-17T23:09:48Z

@Gummibeer - Hiya, Think that I've got things in a far more functional state now. The solution I've landed on seems to handle 80% of the cases via the main "happy path" - that is works perfect without a hitch. Then there's the ones with incorrect or missing meta content-type meta tags; these cases will work just fine, however will have the correct meta tag injected.

mallardduck added 8 commits October 2, 2022 15:54

Apply 8.1 hotfixes from unmerged patch

67cd354

Initial HTML replacer code

8c15b01

remove unused property

6a8cb4d

generate new emoji bytes

b8af7e5

clean up code

b8bedbf

Add test to cover image alt/title attributes

bb84284

refactor to use XPath to solve filtering text nodes problem

eb121c0

Remove try-guy now that it's unused

92d9136

mallardduck commented Oct 2, 2022

View reviewed changes

Gummibeer assigned mallardduck Oct 4, 2022

Gummibeer linked an issue Oct 4, 2022 that may be closed by this pull request

Add HTML parser/replacer #2

Open

3 tasks

Gummibeer added enhancement Hacktoberfest https://hacktoberfest.digitalocean.com hacktoberfest-accepted labels Oct 4, 2022

mallardduck added 6 commits October 4, 2022 08:58

refactor to ensure we allow HTML fragments too

4538750

refactor tests to split up HTML pages and HTML fragments

cd6e190

Use internal tag as means of warning?

60c987f

Refactor method name to slightly better option

f7616c0

fix code styles

e818b5c

make styleCI happy

b1f83c7

mallardduck commented Oct 4, 2022

View reviewed changes

src/HtmlReplacer.php Show resolved Hide resolved

mallardduck requested a review from Gummibeer October 4, 2022 13:25

Gummibeer reviewed Oct 4, 2022

View reviewed changes

mallardduck added 5 commits October 4, 2022 11:18

Refactor to fix missed fragments and expand tests

29f7d0a

reorder code

eeb5f0a

Refactor new tests and add failing tests for current issues.

fcdd93d

fix styles

2d77cdc

track the Pest helper file

e0f2540

mallardduck force-pushed the html-parsing branch from 30c7f40 to e0f2540 Compare October 17, 2022 16:18

fix pest file styles

bc61a6a

Add tests that cover the edge case I've been chasing

b3a57be

mallardduck commented Oct 17, 2022

View reviewed changes

mallardduck added 14 commits October 17, 2022 14:12

Refactor how HTML fragments are handled

62fdc48

Ensure extra spaces are not added

3dd8fb5

Update tests with fixed results

b2bc8bb

Manually correct snapshots to desired state

2e3278d

Skip HTML fragment tests that cause errors

55badec

Refactor exception

c446d6f

remove dumper from composer file

1b5b3e0

Always use static builder method instead of new

2ee59bf

Improve fragment parsing and enable more tests

891c25f

Correct HTML pages without meta charset tag

2b77947

refactor UTF8 tag adding and enable test

d5b6869

Add test to cover when incorrect content type is corrected

894b79f

Add ext-dom to suggested

590ac97

adjust styles

4468c8e

mallardduck changed the title ~~WIP: Add Html parsing features~~ Add HTML parsing features Oct 17, 2022

mallardduck requested a review from Gummibeer October 17, 2022 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HTML parsing features #11

Add HTML parsing features #11

mallardduck commented Oct 2, 2022 •

edited

Loading

mallardduck Oct 2, 2022

Gummibeer Oct 4, 2022

mallardduck Oct 4, 2022

Gummibeer Oct 4, 2022

mallardduck Oct 4, 2022

Gummibeer Oct 4, 2022

mallardduck Oct 5, 2022

mallardduck Oct 5, 2022

Gummibeer Oct 5, 2022

mallardduck Oct 5, 2022

Gummibeer Oct 6, 2022

mallardduck Oct 17, 2022

mallardduck Oct 17, 2022

mallardduck Oct 17, 2022

mallardduck Oct 17, 2022

mallardduck Oct 17, 2022

mallardduck Oct 17, 2022

mallardduck Oct 17, 2022

Gummibeer commented Oct 17, 2022

mallardduck commented Oct 17, 2022

		@@ -0,0 +1 @@
		<p>This is some fancy-<img src="https://twemoji.maxcdn.com/v/latest/72x72/1f483.png" alt="💃" width="72" height="72" loading="lazy" class="twemoji"> Markdown/WYSIWYG text with surrounding <p> tags enabled. <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f389.png" alt="🎉" width="72" height="72" loading="lazy" class="twemoji"></p></p>

Add HTML parsing features #11

Are you sure you want to change the base?

Add HTML parsing features #11

Conversation

mallardduck commented Oct 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gummibeer commented Oct 17, 2022

mallardduck commented Oct 17, 2022

mallardduck commented Oct 2, 2022 •

edited

Loading