Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML parsing features #11

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

mallardduck
Copy link

@mallardduck mallardduck commented Oct 2, 2022

This is a WIP PR to address #2 - however I will resubmit/force-push this branch once we have #10 merged in to fix PHP 8.1 support fully.

As it stands now, here are the remaining concerns to address:

  1. How to flag as experimental? Is there an PHPDoc annotation I can apply to the class? Might be addressed, waiting for confirmation.
  2. Based on the initial issue it sounded like a new method on Twemoji but I opted for a new Replacer class. This way the new functionality doesn't directly modify the base Twemoji behaviour. (If this works well for HTML I can work on a follow up PR to address adding a Markdown Replacer class)
  3. How should the HTML replacer filter nodes for emoji replacement? I need to do more testing around this and add more test data to the pest datasets. This will help us be more informed on how PHP will filter things out best. Solved: Now using XPath to filter only the Text type nodes from the document. In the DOM the Text node type represents the inner Text of an element. So by filtering the page first to the body, then selecting all text elements we can safely catch all emoji.

src/HtmlReplacer.php Outdated Show resolved Hide resolved
}
}

public function parse(string $html): string
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to support full HTML docs and HTML fragments, then this method should:

  1. Immediately determine if the input $html is a full DOM page, then
  2. either use HtmlPage (used here) and work based on the Body, or
  3. use the HtmlPageCrawler to parse the fragment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in PHP partial HTML is more common than a full document. Except you are implementing it as some kind of middleware to parse the whole HTML response.
But in general it should support both if possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this by using the more general HTML parser, then adding a step where we check if the input HTML is a Page/Doc and selecting the body from that. As I was already replacing based on a HTML fragment (the body), supporting fragments as input was rather simple.

@Gummibeer Gummibeer linked an issue Oct 4, 2022 that may be closed by this pull request
3 tasks
@Gummibeer Gummibeer added enhancement Hacktoberfest https://hacktoberfest.digitalocean.com hacktoberfest-accepted labels Oct 4, 2022
@mallardduck mallardduck requested a review from Gummibeer October 4, 2022 13:25
HTML,
]);

dataset('html-fragments', [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add the most simple fragments here first? And add the more complex/combined ones to the end?

<p>🚀</p>
<img src="" alt="🎉"/>
<a href="" title="🎈">Link ⛓️</a>

So some of these single-element examples.

I can also imagine that example which shouldn'T be replaced:

<script>document.innerHTML = '🤷‍♂️';</script>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added those more simple fragments to the top and even moved the fragment dataset to be the "primary" source.

This feedback clued me off to a flaw in my "early return" logic. So now I'm making sure I check for existence of "Text" nodes rather than children in general. Since your first example has no actual children but does have Text nodes I can work on that was being skipped with my original logic.

Only issue this highlighted is that currently the code does affect the script tag contents. I'll circle back later today/tomorrow to look into that aspect of things. Seems like I'll need to get creative about creating some sort of exclude list to skip Text nodes inside elements like script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for <style> I think. But like said, this will be experimental - so It's also fine to add a ->wip() todo() or whatever Pest test for such things. To highlight what has to be fixed/fulfilled to get it out of experimental mode.

How about passing Text only without any nodes?

This is some fancy-💃 Markdown/WYSIWYG text with surrounding <p> tags disabled. 🎉

Or without surrounding elements but one somewhere in the middle?

This is some fancy-💃 Markdown/WYSIWYG text with surrounding <code><p></code> tags disabled. 🎉

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I haven't gotten back around to creating an exclude list concept, however I'm going to start taking that on next. Just wanted to report back on my findings related to passing those "plain text" examples. Unfortunately it seems that passing text not wrapped in a node is interpreted in a variety of ambiguous ways.

For instance, the first example technically should have those braces escaped as &lt; and &gt; - otherwise browsers (and PHP DOM) will automatically parse the P tag and add a close. So I've opted to test with the escaped codes and this line is interpreted as XML. So internally it's wrapped with a _root and as such the XPath to find the inner text nodes fails. Ultimately the result is the text gets returned untouched and Emoji's stay emojis. The second example is basically the same too - just without needing to escape the HTML tag since the code tag has a close.

Long story short, it got complicated really fast lol. So I think I need to take a step back and focus on the HTML parsing situation and get a handle on how various inputs are being parsed. The DOM library I picked for us uses Symfony/dom-crawler as it's core which in turn uses Mastermind\HTML5 library to compensate for PHP's core dom having some failings.

All that said, my next step will be to create my own test repo to assess the overall situation for PHP's HTML parsing abilities. Since I think we will want a more clear picture about what behaviour I'm seeing from each mechanic. For instance, how is PHP core ext-dom treating things, vs how Mastermind\HTML5 parser, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in general I can say that the symfony Dom crawler is an outstanding package and one of if not the best solution for HTML in PHP.

Like said before: don't escalate this too far. With tests that are meant to fail we at least have a good list of things needed for a final release. And even if you fix some of those, take the chance for more PRs. 😉🎃

Regarding the plain text thing: how about a detector with some cases/match arms?

  1. full HTML document
  2. partial HTML but starting/ending with a tag or even surrounded by one
  3. text with HTML sprinkles - solution could be to surround it with <div>, do the job and afterwards take $div->innerHTML

Again, these are just ideas - I'm also personally more happy to merge several smaller and focused PRs than one beast of a PR. As it's also easier to review these smaller ones.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for that perspective! Always appreciate the sage advice and I think you nailed it. I'll start to refocus this PR to cover less of the edge cases and focus on the most common uses. Then I can adjust the tests a bit to give us a guiding star for future enhancements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one last thing: personally I would prefer datasets only for the same thing. Meaning really the same test case.

it does not replace emojis in HTML attributes

<a title="🚀">hello</a>
<img alt="🖼️"/>

it replaces emojis in text nodes

<p>🎉</p>
<div>💃</div>

So I'm more happy with a lot of copy'n'paste test cases with better/easier descriptions than one test case running through a massive dataset and no one really knows what's tested with that data-point and what is still missing or not covered. So every expectation to the lib should be one test case - and we can bullet proof that expectation by adding multiple variants of it. Like <a> and <img/> for the attributes. And later someone adds <button> to it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the tests now and followed your advice of individual tests. Great idea as this allows me to more obviously target edge cases. So I've taken advantage of that by adding some explicit expectations to tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news!

I've caught the root of the issue I've been chasing that I thought was something I did wrong. Specifically the way HTML attributes were being cast/converted to their HTML entities equivalent instead of staying unicode.

It turns out that this is caused by DOMDocument itself. I thought it was being caused by either: a) the xpath I found to use, or b) the HTML5 transforming library I opted to choose. In the end, reading the comments on PHP's docs for DOMDocument was my salvation! Specifically the 15 year old comment from xuanbn that said:

I have discovered that, to help loadHTML() process utf file correctly, the meta tag should come first, before any utf string appear. For example, this HTML file. 

<html>
    <head>
        <meta http-equiv="content-type" content="text/html; charset=utf-8">
        <title> Vietnamese - Tiếng Việt</title>
    </head>
<body></body>
</html>

With that new info I added an additional test that shows how the HTML being input determines how DOMDocument treats things up on calling saveHtml(). So I think this gets us to a lot better place where I've created a working solution, but we show that it's experimental and highlight the issue with a (skipped) failing test.

<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "correct" meta tag to set charset to UTF8.
This will allow DOMDocument to preserve the raw Emoji in the document.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "incorrect" method to set charset to UTF8.
Unfortunately it will not be respected by DOMDocument and Emoji will become HTML entities.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be fixed before merge - we should just wrap all fragments in our own HTML page that uses the correct charset meta tag. This will ensure they are encoded correctly and then we can capture the fragment from our "template".

<h1>Do a quick kickflip! <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f6f9.png" alt="🛹" width="72" height="72" loading="lazy" class="twemoji"></h1>
<p>This is HTML text that should be replaced, but the emoji in the head should not.</p>
<h2>Time for a CRAB RAVE!</h2>
<p><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"><img src="https://twemoji.maxcdn.com/v/latest/72x72/1f980.png" alt="🦀" width="72" height="72" loading="lazy" class="twemoji"></p>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here, how the alt HTML attribute is still a valid Emoji.

@@ -0,0 +1 @@
<p>This is some fancy-<img src="https://twemoji.maxcdn.com/v/latest/72x72/1f483.png" alt="&#128131;" width="72" height="72" loading="lazy" class="twemoji"> Markdown/WYSIWYG text with surrounding <p> tags enabled. <img src="https://twemoji.maxcdn.com/v/latest/72x72/1f389.png" alt="&#127881;" width="72" height="72" loading="lazy" class="twemoji"></p></p>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However with this fragment over here it's obvious DOMDocument executes the normal behavior. The alt attribute is incorrectly encoded to html entity.

@Gummibeer
Copy link
Member

Just let me know when you want me to review and merge it^^

@mallardduck mallardduck changed the title WIP: Add Html parsing features Add HTML parsing features Oct 17, 2022
@mallardduck mallardduck requested a review from Gummibeer October 17, 2022 23:05
@mallardduck
Copy link
Author

@Gummibeer - Hiya, Think that I've got things in a far more functional state now. The solution I've landed on seems to handle 80% of the cases via the main "happy path" - that is works perfect without a hitch. Then there's the ones with incorrect or missing meta content-type meta tags; these cases will work just fine, however will have the correct meta tag injected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Hacktoberfest https://hacktoberfest.digitalocean.com hacktoberfest-accepted
Development

Successfully merging this pull request may close these issues.

Add HTML parser/replacer
2 participants