Adds Markdown support, separates smiley parsing from BBC parsing, adds SMF\Parser base class and subclasses #8349

Sesquipedalian · 2024-12-01T21:55:56Z

This PR adds support for Markdown syntax. (It also does some other stuff; see "Other Changes".)

The specific flavour of Markdown this supports is GitHub Flavoured Markdown, which is CommonMark plus a tables extension and a strikethrough extension.

Main features:

Markdown syntax is supported everywhere that BBCode is supported.
Markdown and BBCode can be mixed together arbitrarily within the same content.
Some BBCode features have been added or adjusted in order to ensure feature parity between BBCode and Markdown syntax support:
- New [h1] to [h6] BBCodes have been added.
- Code has been added to adjust heading elements in generated HTML output so that headings in user-supplied content don't break the overall document outline of the HTML page.
- The first row of BBCode tables is now rendered as a table header row when converted to HTML.
- The [tt] BBCode has been un-deprecated and defined to output inline <code> elements.
- The [list] BBCode will now output an <ol> or a <ul> as appropriate according to the value of the BBCode's type attribute. (Previously, we always output a <ul> even when the CSS made the list appear as if it were ordered.)

Minor features:

Paragraphs of text in the HTML output are always wrapped in semantically correct <p> elements when Markdown support is enabled. This is better for accessibility, among other benefits.
Adds two options to allow admins to clean up errant line breaks in user input in order to produce more consistent formatting in the output:
- Collapse extra blank lines: Enabling this setting will remove unnecessary blank lines between non-empty content elements.
- Clean up line breaks inside paragraphs: Enabling this setting will remove single line breaks inside paragraphs.

Other changes:

Implements an abstract SMF\Parser base class.
Moves SMF\BBCodeParser to SMF\Parsers\BBCodeParser.
Moves smiley parsing out of SMF\Parsers\BBCodeParser into SMF\Parsers\SmileyParser.
Replaces calls to BBCodeParser::parse() throughout the codebase with calls to Parser::transform(), which is able to leverage all three of SMF\Parsers\BBCodeParser, SMF\Parsers\MarkdownParser, and SMF\Parsers\SmileyParser in order to intelligently handle the process of transforming markup (i.e. BBCode, Markdown, HTML, and smiley codes) from one form to another as needed.
Improves the semantics of the HTML generated from various BBCodes (e.g. <strong>, <em>, etc.).
Improves our handling of tab characters in user supplied data. This was necessary because tabs can have syntactic significance in Markdown.
Fixes a bug where we were not respecting the value of Config::$modSettings['enablePostHTML'] when BBCode was disabled.

More info:

SMF\Parsers\MarkdownParser::parse() can be used to generate the following types of output:
- HTML output that strictly adheres to the baseline GFM spec.
- HTML output that matches the output that would be produced by the equivalent BBCode, which means customizations to BBCode output are automatically carried over to the equivalent Markdown output. This is the default output type.
- The equivalent BBCode (i.e. it can convert Markdown into BBCode).
Our MarkdownParser is more accurate (and thus restrictive) when identifying absolute URIs than the spec. For example, the spec will accept obviously invalid URIs like a+b+c:d, made-up-scheme://foo,bar, or http://../. SMF is capable of much more robust URI validation, so we use it.
Whereas Markdown hypothetically allows any arbitrary HTML to be embedded into documents, SMF in fact restricts the allowed HTML in user input just like it always has. Thus, users that paste Markdown containing disallowed HTML into their posts will see that HTML turned into plain text; it will not be rendered.
Although a CommonMark extension exists for PHP itself, we don't use it because:
1. It is not installed by default in most versions of PHP.
2. It doesn't support tables or strikethrough text, which we want for the sake of better feature parity with BBCode.
3. We need to do some special handling in order to make BBCode and Markdown play nicely with each other, and it was much harder to accomplish that using the PHP extension than it was to just write our own parser.

Sesquipedalian · 2024-12-01T22:20:33Z

Hm. I used to have a complete unit testing file for this back when I was working on it during the summer, but now I can't find it. Anyone who wants one can just grab a copy of the GFM spec, extract all the examples it gives, and run them through SMF\Parsers\MarkdownParser::parse() with the output type set to OUTPUT_HTML_STRICT.

Sesquipedalian · 2024-12-01T22:25:51Z

As an additional unit test, specifically focussed on ensuring the BBCode and Markdown play nicely together, try out this raw post text:

(Note, change the occurrences of ` to backtick characters before running the test.)

<div>
<a href="http://example.com">example</a>
</div>




[table][tr][td]BBCode[/td]
[td]Table[/td]
[/tr]
[tr][td]foo[/td]
[td]bar[/td]
[/tr]
[/table]


| Markdown | Table |
| ------------ | ------ |
| baz | qux|


derp derp
derpy doo

> Lorem ipsum dolor
sit amet.
>
> - Qui *quodsi iracundia*
> - aliquando id

1. Lorem ipsum dolor
  sit amet.
2. Qui *quodsi iracundia*
  aliquando id

Lorem ipsum dolor
sit amet.
Qui *quodsi iracundia*
aliquando id


[center]derp *foo**bar**baz* derp[/center]

[code]
some    tab


*code*

[/code]


[linktext](http://example.com)

[reflink]

[tt]inline code[/tt]

[code=php]$foo = 'bar';[/code]


[php]	$foo = 'bar';[/php]

Derp	derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp	tabs.
Derp0000

&#x60;&#x60;&#x60;PHP
$baz = true;
&#x60;&#x60;&#x60;

[h1]BBC Heading 1[/h1]
derp
[h2]BBC Heading 2[/h2]
derp
[h3]BBC Heading 3[/h3]
derp
[h4]BBC Heading 4[/h4]
derp
[h5]BBC Heading 5[/h5]
derp
[h6]BBC Heading 6[/h6]
derp

# Markdown Heading 1
derp
## Markdown Heading 2
derp
### Markdown Heading 3
derp
#### Markdown Heading 4
derp
##### Markdown Heading 5
derp
###### Markdown Heading 6
derp

[reflink]: http://example.com

The expected HTML output when that is put into a forum post is:

                                                <p>
                                                    &lt;div&gt;
                                                    <a href="http://example.com" class="bbc_link" target="_blank" rel="noopener">example</a>

                                                    &lt;/div&gt;
                                                </p>
                                                <table class="bbc_table">
                                                    <thead>
                                                        <tr>
                                                            <th>
                                                                <p>BBCode</p>
                                                            </th>
                                                            <th>
                                                                <p>Table</p>
                                                            </th>
                                                        </tr>
                                                    </thead>
                                                    <tbody>
                                                        <tr>
                                                            <td>
                                                                <p>foo</p>
                                                            </td>
                                                            <td>
                                                                <p>bar</p>
                                                            </td>
                                                        </tr>
                                                    </tbody>
                                                </table>
                                                <table class="bbc_table">
                                                    <thead>
                                                        <tr>
                                                            <th>Markdown</th>
                                                            <th>Table</th>
                                                        </tr>
                                                    </thead>
                                                    <tbody>
                                                        <tr>
                                                            <td>baz</td>
                                                            <td>qux</td>
                                                        </tr>
                                                    </tbody>
                                                </table>
                                                <p>derp derp
                                                derpy doo</p>
                                                <blockquote class="bbc_standard_quote">
                                                    <cite>Quote</cite>

                                                    <p>Lorem ipsum dolor
                                                    sit amet.</p>
                                                    <ul class="bbc_list" style="list-style-type: disc;">
                                                        <li>
                                                            Qui 
                                                            <em>quodsi iracundia</em>
                                                        </li>
                                                        <li>aliquando id</li>
                                                    </ul>
                                                </blockquote>
                                                <ol class="bbc_list" style="list-style-type: decimal;">
                                                    <li>Lorem ipsum dolor
                                                    sit amet.</li>
                                                    <li>
                                                        Qui 
                                                        <em>quodsi iracundia</em>

                                                        aliquando id
                                                    </li>
                                                </ol>
                                                <p>
                                                    Lorem ipsum dolor
                                                    sit amet.
                                                    Qui 
                                                    <em>quodsi iracundia</em>

                                                    aliquando id
                                                </p>
                                                <div class="centertext">
                                                    <div class="inline-block">
                                                        <p>
                                                            derp 
                                                            <em>
                                                                foo
                                                                <strong>bar</strong>
                                                                baz
                                                            </em>
                                                             derp
                                                        </p>
                                                    </div>
                                                </div>
                                                <div class="codeheader">
                                                    <span class="code">Code</span>
                                                    <a class="codeoperation smf_select_text">Select</a>
                                                    <a class="codeoperation smf_expand_code hidden" data-shrink-txt="Shrink" data-expand-txt="Expand">Expand</a>
                                                </div>
                                                <code class="bbc_code">
                                                    some&nbsp; &nbsp; tab
                                                    <br>
                                                    <br>
                                                    <br>
                                                    *code*
                                                    <br>
                                                    <br>
                                                </code>
                                                <p>
                                                    <a href="http://example.com" class="bbc_link" target="_blank" rel="noopener">linktext</a>
                                                </p>
                                                <p>
                                                    <a href="http://example.com" class="bbc_link" target="_blank" rel="noopener">reflink</a>
                                                </p>
                                                <p>
                                                    <code class="bbc_tt">inline code</code>
                                                </p>
                                                <div class="codeheader">
                                                    <span class="code">Code</span>
                                                     (PHP) 
                                                    <a class="codeoperation smf_select_text">Select</a>
                                                    <a class="codeoperation smf_expand_code hidden" data-shrink-txt="Shrink" data-expand-txt="Expand">Expand</a>
                                                </div>
                                                <code class="bbc_code">
                                                    <span style="color: #0000BB">$foo </span>
                                                    <span style="color: #007700">= </span>
                                                    <span style="color: #DD0000">&#039;bar&#039;</span>
                                                    <span style="color: #007700">;</span>
                                                    <span style="color: #0000BB"></span>
                                                </code>
                                                <p>
                                                    <code class="phpcode">
                                                        <span style="color: #0000BB">
                                                            <span style="white-space: pre-wrap;"></span>
                                                        </span>
                                                        <span style="color: #0000BB">$foo </span>
                                                        <span style="color: #007700">= </span>
                                                        <span style="color: #DD0000">&#039;bar&#039;</span>
                                                        <span style="color: #007700">;</span>
                                                        <span style="color: #0000BB"></span>
                                                    </code>
                                                </p>
                                                <p>
                                                    Derp
                                                    <span style="white-space: pre-wrap;"></span>
                                                    derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp
                                                    <span style="white-space: pre-wrap;"></span>
                                                    tabs.
                                                    Derp0000
                                                </p>
                                                <div class="codeheader">
                                                    <span class="code">Code</span>
                                                     (PHP) 
                                                    <a class="codeoperation smf_select_text">Select</a>
                                                    <a class="codeoperation smf_expand_code hidden" data-shrink-txt="Shrink" data-expand-txt=""></a>
                                                </div>
                                                <code class="bbc_code">
                                                    <span style="color: #0000BB">$baz </span>
                                                    <span style="color: #007700">= </span>
                                                    <span style="color: #0000BB">true</span>
                                                    <span style="color: #007700">;</span>
                                                    <span style="color: #0000BB"></span>
                                                </code>
                                                <h5 class="bbc_h1">BBC Heading 1
                                                </h5>
                                                <p>derp</p>
                                                <h6 class="bbc_h2">BBC Heading 2
                                                </h6>
                                                <p>derp</p>
                                                <div class="bbc_h3">BBC Heading 3
                                                </div>
                                                <p>derp</p>
                                                <div class="bbc_h4">BBC Heading 4
                                                </div>
                                                <p>derp</p>
                                                <div class="bbc_h5">BBC Heading 5
                                                </div>
                                                <p>derp</p>
                                                <div class="bbc_h6">BBC Heading 6
                                                </div>
                                                <p>derp</p>
                                                <h5 class="bbc_h1">Markdown Heading 1</h5>

                                                <p>derp</p>
                                                <h6 class="bbc_h2">Markdown Heading 2</h6>

                                                <p>derp</p>
                                                <div class="bbc_h3">Markdown Heading 3</div>

                                                <p>derp</p>
                                                <div class="bbc_h4">Markdown Heading 4</div>

                                                <p>derp</p>
                                                <div class="bbc_h5">Markdown Heading 5</div>

                                                <p>derp</p>
                                                <div class="bbc_h6">Markdown Heading 6</div>

                                                <p>derp</p>

jdarwood007 · 2024-12-02T04:12:36Z

I know you have already done lots of work, but it feels a bit backward if we add code parsing BBC and then markdown everywhere it needs. Anyone using our code would also need to update theirs to parse markdown.

I wonder if it's appropriate to add a 'parser' class that would have an enum (at least in 8.1+, but constants for 8.0) that defines whether you want BBC, markdown, or both.

Something like:

\SMF\Parser->parse($body, Parser::PARSE_BBC | Parser::PARSE_MARKDOWN);

Backward compatible function calls just define to use BBC. Mod authors who are supporting 3.0 code natively, can just call the parser and get both automatically.

If we support more/different options in the future, it will be easier to just add it.

Sesquipedalian · 2024-12-02T07:03:13Z

An abstract class and an interface might be a good idea, yes.

However, when I was writing this I found that it did not work well to conflate BBCode parsing and Markdown parsing into a single call. Most of the time it is best to parse BBCode and then Markdown, but sometimes it is best to do things in the opposite order. So we need the flexibility to call them in the order that we need. There's also the problem that both parsers need various parameters passed to them, but those sets of parameters have nothing to do with each other. Attempting to munge them all together as one big set of parameters to a single function call would be nasty in a number of different ways.

Nevertheless, you have raised a very important point that I hadn't thought about. The approach I've taken so far in this PR would indeed mean that "Anyone using our code would also need to update theirs to parse markdown," which doesn't work well at all with our goals regarding backward compatibility in 3.0. You are quite right that this is a problem that needs to be addressed. I'll have to think some more about the best way to do so... 🤔

jdarwood007 · 2024-12-02T12:58:50Z

Well, without mods supporting 3.0 native, the code is OOB is backward compatible. We just didn't enable a new feature. Mods would have to build a native 3.0 code to get the MD support.

The feature is complete as is and doesn't need overhaul changes if you don't want to. I would have personally wanted to extend the messages table to add a 'format' column with 1/default = Modern BBC and 2 = markdown and then have an option on the posting page to switch between bbc/md, Some sort of 'version' or other column would also be added to indicate how be how we handle HTML entities as we need to clean that up from the various times we clean and then unclean entries to validate things during inputs. It would be versioned to the SMF version the input was written on and allow our parser to know how to handle cleanups/processing based on us changing the input sanitation rules/other fixes. A scheduled task could even be written to 'upgrade' old posts to newer versions to reduce the amount of parsing old posts go through for security checks.

Signed-off-by: Jon Stovell <[email protected]>

Also fixes WYSIWYG bugs with the php BBCode Signed-off-by: Jon Stovell <[email protected]>

Signed-off-by: Jon Stovell <[email protected]>

This is complicated, because we need to adjust the HTML tags in the output depending on the context in which the HTML is shown. Signed-off-by: Jon Stovell <[email protected]>

Also changes the content of the tab substitute string to something better and unique. Signed-off-by: Jon Stovell <[email protected]>

Signed-off-by: Jon Stovell <[email protected]>

Sesquipedalian · 2024-12-06T20:55:00Z

I took another crack at abstracting our parsing in order to handle the process of transforming one markup type into another. I seem to have solved it this time.

jdarwood007 · 2024-12-06T23:14:44Z

I like it. I didn't expect the smilies to become a parser, but that makes complete sense. There's a triple win here (Markdown, Parser customization, and smiley parsing separation).

Signed-off-by: Jon Stovell <[email protected]>

Sesquipedalian · 2024-12-07T22:34:57Z

This was planned for Alpha 4, but I don't want to worry about conflicts, so if no one objects, I'd like to merge it soon.

jdarwood007 · 2024-12-08T04:07:54Z

Sources/Parser.php

+		// Allow mods access before parsing.
+		$smileys = !empty($input_types & self::INPUT_SMILEYS);
+
+		IntegrationHook::call('integrate_pre_parsebbc', [&$string, &$smileys, &$options['cache_id'], &$options['parse_tags']]);


Do we call the pre_parsebbc here or in the BBC logic?

I moved these two hooks here so that hooked functions could have access to the input string before any changes were made and after all changes were made, which is what these hooks were expected to do back when BBCode parsing was the only type of parsing. If I had left them in BBCodeParser::parse(), hooked functions would have been called in the midst of the overall parsing process, which would have had negative effects depending on what the hooked function tried to do.

If I were creating the hooks from scratch I would have named them integrate_pre_parse and integrate_post_parse (without the trailing bbc). But since the hooks already existed with these names, they continue to have those names.

That makes sense in terms of keeping compatibility with older code.

dragomano · 2024-12-08T08:10:07Z

You can add new things while leaving the old ones:

/* @deprecated since 3.0 */
IntegrationHook::call('integrate_pre_parsebbc', [&$string, &$smileys, &$options['cache_id'], &$options['parse_tags']]);

IntegrationHook::call('integrate_pre_parse', [&$string, &$smileys, &$options['cache_id'], &$options['parse_tags']]);

live627 · 2024-12-09T03:43:54Z

Themes/default/scripts/jquery.sceditor.smf.js

+					editor.toggleSourceMode();
+					editor.toggleSourceMode();
+					editor.sourceEditorCaret({start: caretPos, end: caretPos});
+				}


What does this code do? I'm having trouble understanding what behavoir it adds or modifies.

Without the lines that toggle the mode and then set the caret position, I found that the caret ended up jumping to an unexpected position (the start of the line, IIRC).

Sesquipedalian added New feature PM BBC Theme Profile Fields Editor Posting Polls Web feeds RSS, ATOM, etc. Administrative Drafts labels Dec 1, 2024

Sesquipedalian added this to the 3.0 Alpha 4 milestone Dec 1, 2024

Sesquipedalian force-pushed the 3.0/markdown branch from 3e13f00 to 767734a Compare December 2, 2024 18:21

Sesquipedalian added 12 commits December 6, 2024 13:42

Respects enablePostHTML setting even when BBCode is disabled

0fe1189

Signed-off-by: Jon Stovell <[email protected]>

Improves semantics of BBC output

f29f244

Signed-off-by: Jon Stovell <[email protected]>

Restores the tt BBCode to full status

92a6ca0

Also fixes WYSIWYG bugs with the php BBCode Signed-off-by: Jon Stovell <[email protected]>

Marks first row of BBC tables as table header

816826d

Signed-off-by: Jon Stovell <[email protected]>

Adds support for h1-h6 BBCode tags

731e67a

This is complicated, because we need to adjust the HTML tags in the output depending on the context in which the HTML is shown. Signed-off-by: Jon Stovell <[email protected]>

Moves hard-coded tab substitute string into a constant

1683803

Also changes the content of the tab substitute string to something better and unique. Signed-off-by: Jon Stovell <[email protected]>

Improves handling and display of tab characters in posts, etc.

01b2a4c

Signed-off-by: Jon Stovell <[email protected]>

For the sake of Markdown, convinces SCEditor not to delete tabs

5620163

Signed-off-by: Jon Stovell <[email protected]>

Implements SMF\MarkdownParser

a04186a

Signed-off-by: Jon Stovell <[email protected]>

Adds support for Markdown in posts and PMs

1e7010d

Signed-off-by: Jon Stovell <[email protected]>

Adds support for Markdown in registration agreement and privacy policy

3d4ccd2

Signed-off-by: Jon Stovell <[email protected]>

Adds support for Markdown in user warnings and notices

43328d7

Signed-off-by: Jon Stovell <[email protected]>

Sesquipedalian added 4 commits December 6, 2024 13:42

Adds support for Markdown in signatures

b92e780

Signed-off-by: Jon Stovell <[email protected]>

Adds support for Markdown in polls

6ddff82

Signed-off-by: Jon Stovell <[email protected]>

Adds support for Markdown in moderator notes

58d256e

Signed-off-by: Jon Stovell <[email protected]>

Prevents WYSIWYG autolinking inside Markdown links in editor

5c8fe7e

Signed-off-by: Jon Stovell <[email protected]>

Sesquipedalian force-pushed the 3.0/markdown branch from 767734a to 4333c16 Compare December 6, 2024 20:42

Sesquipedalian changed the title ~~Adds Markdown support~~ Adds Markdown support, separates smiley parsing from BBC parsing, adds SMF\Parser base class and subclasses Dec 6, 2024

Sesquipedalian marked this pull request as draft December 6, 2024 21:08

Sesquipedalian marked this pull request as ready for review December 6, 2024 21:09

Sesquipedalian force-pushed the 3.0/markdown branch 3 times, most recently from 4a04c89 to 8f5f02f Compare December 6, 2024 23:14

Sesquipedalian force-pushed the 3.0/markdown branch from 8f5f02f to 4b2284c Compare December 6, 2024 23:31

Sesquipedalian added 2 commits December 6, 2024 17:10

Implements abstract SMF\Parser class & sub-classes

4a06d05

Signed-off-by: Jon Stovell <[email protected]>

Prevents bypassing disabled BBC setting via Markdown

95d6b70

Signed-off-by: Jon Stovell <[email protected]>

Sesquipedalian force-pushed the 3.0/markdown branch from 4b2284c to 95d6b70 Compare December 7, 2024 02:45

Sesquipedalian requested review from live627, jdarwood007 and BrickOzp December 7, 2024 22:35

jdarwood007 reviewed Dec 8, 2024

View reviewed changes

Sesquipedalian merged commit ba5d10a into SimpleMachines:release-3.0 Dec 8, 2024
6 checks passed

Sesquipedalian deleted the 3.0/markdown branch December 8, 2024 19:17

Sesquipedalian removed request for live627 and BrickOzp December 8, 2024 19:59

live627 reviewed Dec 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Markdown support, separates smiley parsing from BBC parsing, adds SMF\Parser base class and subclasses #8349

Adds Markdown support, separates smiley parsing from BBC parsing, adds SMF\Parser base class and subclasses #8349

Sesquipedalian commented Dec 1, 2024 •

edited

Loading

Sesquipedalian commented Dec 1, 2024 •

edited

Loading

Sesquipedalian commented Dec 1, 2024 •

edited

Loading

jdarwood007 commented Dec 2, 2024

Sesquipedalian commented Dec 2, 2024 •

edited

Loading

jdarwood007 commented Dec 2, 2024

Sesquipedalian commented Dec 6, 2024

jdarwood007 commented Dec 6, 2024

Sesquipedalian commented Dec 7, 2024 •

edited

Loading

jdarwood007 Dec 8, 2024

Sesquipedalian Dec 8, 2024 •

edited

Loading

jdarwood007 Dec 8, 2024

dragomano commented Dec 8, 2024

live627 Dec 9, 2024

Sesquipedalian Dec 10, 2024

Adds Markdown support, separates smiley parsing from BBC parsing, adds SMF\Parser base class and subclasses #8349

Adds Markdown support, separates smiley parsing from BBC parsing, adds SMF\Parser base class and subclasses #8349

Conversation

Sesquipedalian commented Dec 1, 2024 • edited Loading

Main features:

Minor features:

Other changes:

More info:

Sesquipedalian commented Dec 1, 2024 • edited Loading

Sesquipedalian commented Dec 1, 2024 • edited Loading

jdarwood007 commented Dec 2, 2024

Sesquipedalian commented Dec 2, 2024 • edited Loading

jdarwood007 commented Dec 2, 2024

Sesquipedalian commented Dec 6, 2024

jdarwood007 commented Dec 6, 2024

Sesquipedalian commented Dec 7, 2024 • edited Loading

jdarwood007 Dec 8, 2024

Choose a reason for hiding this comment

Sesquipedalian Dec 8, 2024 • edited Loading

Choose a reason for hiding this comment

jdarwood007 Dec 8, 2024

Choose a reason for hiding this comment

dragomano commented Dec 8, 2024

live627 Dec 9, 2024

Choose a reason for hiding this comment

Sesquipedalian Dec 10, 2024

Choose a reason for hiding this comment

Sesquipedalian commented Dec 1, 2024 •

edited

Loading

Sesquipedalian commented Dec 1, 2024 •

edited

Loading

Sesquipedalian commented Dec 1, 2024 •

edited

Loading

Sesquipedalian commented Dec 2, 2024 •

edited

Loading

Sesquipedalian commented Dec 7, 2024 •

edited

Loading

Sesquipedalian Dec 8, 2024 •

edited

Loading