Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neutralize or remove illegal self-closing tags #18

Merged
merged 2 commits into from
Sep 1, 2023

Conversation

Kdecherf
Copy link

Given default settings balance=1 and keep_bad=6 and the following input HTML code:

<p>Hello world</p>
<figure />
<p>Lorem Ipsum</p>

hl_tag and hl_balance incorrectly return this, which can lead to broken structure:

<p>Hello world</p>
<figure>
<p>Lorem Ipsum</p>
</figure>

So this change lets hl_tag neutralize or remove "self-closing" tags for which an ending tag is mandatory according to the HTML spec.

Then, the input HTML code above would now be transformed to the following code:

<p>Hello world</p>
<p>Lorem Ipsum</p>

As a side note, keep_bad=5 would make HTMLawed to return the following:

<p>Hello world</p>
&lt;figure /&gt;
<p>Lorem Ipsum</p>

Copy link
Member

@jtojnar jtojnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add tests?

htmLawed.php Outdated
@@ -966,7 +966,7 @@ function hl_tag($t)
if ($t == '>') {
return '&gt;';
}
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?>$`m', $t, $m)) { // Get tag with element name and attributes
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?(/?)>$`m', $t, $m)) { // Get tag with element name and attributes
Copy link
Member

@jtojnar jtojnar Aug 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, should not this be \s* instead of \s??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in case of unquoted attributes, the space is mandatory, see the note in https://developer.mozilla.org/en-US/docs/Glossary/Void_element#self-closing_tags. So, for example, <hr size=5/> should get parsed the same as <hr size="5/">.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll check that 👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added references to the MDN in the commit message and I've updated the pattern (see 2295cf7):

2023-08-12-121846_597x288

However, it seems that there's a bug (unrelated to my PR) regarding the trailing slash without a space where HTMLawed still consider it as a self-closing tag and add the missing space.

htmLawed.php Outdated
@@ -966,7 +966,7 @@ function hl_tag($t)
if ($t == '>') {
return '&gt;';
}
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?>$`m', $t, $m)) { // Get tag with element name and attributes
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?(/?)>$`m', $t, $m)) { // Get tag with element name and attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use named capture group, index 4 is just confusing.

Suggested change
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?(/?)>$`m', $t, $m)) { // Get tag with element name and attributes
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?(?P<selfclosing>/?)>$`m', $t, $m)) { // Get tag with element name and attributes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm afraid it would make it harder to keep up with upstream if we move to named capture groups. Are you suggesting this only for the self-closing tag or for all capture groups of this line as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes only for our group.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 2295cf7

@Kdecherf
Copy link
Author

Could you please add tests?

This is planned, this is why I left the PR as a draft for now :)

Given default settings `balance=1` and `keep_bad=6` and the following
input HTML code:

    <p>Hello world</p>
    <figure />
    <p>Lorem Ipsum</p>

`hl_tag` and `hl_balance` incorrectly return this, which can lead to
broken structure:

    <p>Hello world</p>
    <figure>
    <p>Lorem Ipsum</p>
    </figure>

So this change lets `hl_tag` neutralize or remove "self-closing" tags
for which an ending tag is mandatory according to the HTML spec.

Then, the input HTML code above would now be transformed to the
following code:

    <p>Hello world</p>
    <p>Lorem Ipsum</p>

As a side note, `keep_bad=5` would make HTMLawed to return the
following:

    <p>Hello world</p>
    &lt;figure /&gt;
    <p>Lorem Ipsum</p>

References:
- https://developer.mozilla.org/en-US/docs/Glossary/Void_element
- https://developer.mozilla.org/en-US/docs/Glossary/Void_element#self-closing_tags

Signed-off-by: Kevin Decherf <[email protected]>
@Kdecherf Kdecherf marked this pull request as ready for review August 12, 2023 10:37
@Kdecherf Kdecherf requested review from jtojnar and j0k3r August 12, 2023 10:37
Copy link

@j0k3r j0k3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@j0k3r j0k3r merged commit 3721c9d into fossar:master Sep 1, 2023
5 checks passed
@Kdecherf Kdecherf deleted the feat/prune-selfclosed-tags branch September 18, 2023 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants