-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Neutralize or remove illegal self-closing tags #18
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add tests?
htmLawed.php
Outdated
@@ -966,7 +966,7 @@ function hl_tag($t) | |||
if ($t == '>') { | |||
return '>'; | |||
} | |||
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?>$`m', $t, $m)) { // Get tag with element name and attributes | |||
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?(/?)>$`m', $t, $m)) { // Get tag with element name and attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, should not this be \s*
instead of \s?
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in case of unquoted attributes, the space is mandatory, see the note in https://developer.mozilla.org/en-US/docs/Glossary/Void_element#self-closing_tags. So, for example, <hr size=5/>
should get parsed the same as <hr size="5/">
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll check that 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added references to the MDN in the commit message and I've updated the pattern (see 2295cf7):
However, it seems that there's a bug (unrelated to my PR) regarding the trailing slash without a space where HTMLawed still consider it as a self-closing tag and add the missing space.
htmLawed.php
Outdated
@@ -966,7 +966,7 @@ function hl_tag($t) | |||
if ($t == '>') { | |||
return '>'; | |||
} | |||
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?>$`m', $t, $m)) { // Get tag with element name and attributes | |||
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?(/?)>$`m', $t, $m)) { // Get tag with element name and attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use named capture group, index 4 is just confusing.
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?(/?)>$`m', $t, $m)) { // Get tag with element name and attributes | |
if (!preg_match('`^<(/?)([a-zA-Z][^\s>]*)([^>]*?)\s?(?P<selfclosing>/?)>$`m', $t, $m)) { // Get tag with element name and attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm afraid it would make it harder to keep up with upstream if we move to named capture groups. Are you suggesting this only for the self-closing tag or for all capture groups of this line as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes only for our group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 2295cf7
This is planned, this is why I left the PR as a draft for now :) |
Given default settings `balance=1` and `keep_bad=6` and the following input HTML code: <p>Hello world</p> <figure /> <p>Lorem Ipsum</p> `hl_tag` and `hl_balance` incorrectly return this, which can lead to broken structure: <p>Hello world</p> <figure> <p>Lorem Ipsum</p> </figure> So this change lets `hl_tag` neutralize or remove "self-closing" tags for which an ending tag is mandatory according to the HTML spec. Then, the input HTML code above would now be transformed to the following code: <p>Hello world</p> <p>Lorem Ipsum</p> As a side note, `keep_bad=5` would make HTMLawed to return the following: <p>Hello world</p> <figure /> <p>Lorem Ipsum</p> References: - https://developer.mozilla.org/en-US/docs/Glossary/Void_element - https://developer.mozilla.org/en-US/docs/Glossary/Void_element#self-closing_tags Signed-off-by: Kevin Decherf <[email protected]>
Signed-off-by: Kevin Decherf <[email protected]>
a8330cc
to
2295cf7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Given default settings
balance=1
andkeep_bad=6
and the following input HTML code:hl_tag
andhl_balance
incorrectly return this, which can lead to broken structure:So this change lets
hl_tag
neutralize or remove "self-closing" tags for which an ending tag is mandatory according to the HTML spec.Then, the input HTML code above would now be transformed to the following code:
As a side note,
keep_bad=5
would make HTMLawed to return the following: