Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Markdown support, separates smiley parsing from BBC parsing, adds SMF\Parser base class and subclasses #8349

Merged

Conversation

Sesquipedalian
Copy link
Member

@Sesquipedalian Sesquipedalian commented Dec 1, 2024

This PR adds support for Markdown syntax. (It also does some other stuff; see "Other Changes".)

The specific flavour of Markdown this supports is GitHub Flavoured Markdown, which is CommonMark plus a tables extension and a strikethrough extension.

Main features:

  • Markdown syntax is supported everywhere that BBCode is supported.
  • Markdown and BBCode can be mixed together arbitrarily within the same content.
  • Some BBCode features have been added or adjusted in order to ensure feature parity between BBCode and Markdown syntax support:
    • New [h1] to [h6] BBCodes have been added.
    • Code has been added to adjust heading elements in generated HTML output so that headings in user-supplied content don't break the overall document outline of the HTML page.
    • The first row of BBCode tables is now rendered as a table header row when converted to HTML.
    • The [tt] BBCode has been un-deprecated and defined to output inline <code> elements.
    • The [list] BBCode will now output an <ol> or a <ul> as appropriate according to the value of the BBCode's type attribute. (Previously, we always output a <ul> even when the CSS made the list appear as if it were ordered.)

Minor features:

  • Paragraphs of text in the HTML output are always wrapped in semantically correct <p> elements when Markdown support is enabled. This is better for accessibility, among other benefits.
  • Adds two options to allow admins to clean up errant line breaks in user input in order to produce more consistent formatting in the output:
    • Collapse extra blank lines: Enabling this setting will remove unnecessary blank lines between non-empty content elements.
    • Clean up line breaks inside paragraphs: Enabling this setting will remove single line breaks inside paragraphs.

Other changes:

  • Implements an abstract SMF\Parser base class.
  • Moves SMF\BBCodeParser to SMF\Parsers\BBCodeParser.
  • Moves smiley parsing out of SMF\Parsers\BBCodeParser into SMF\Parsers\SmileyParser.
  • Replaces calls to BBCodeParser::parse() throughout the codebase with calls to Parser::transform(), which is able to leverage all three of SMF\Parsers\BBCodeParser, SMF\Parsers\MarkdownParser, and SMF\Parsers\SmileyParser in order to intelligently handle the process of transforming markup (i.e. BBCode, Markdown, HTML, and smiley codes) from one form to another as needed.
  • Improves the semantics of the HTML generated from various BBCodes (e.g. <strong>, <em>, etc.).
  • Improves our handling of tab characters in user supplied data. This was necessary because tabs can have syntactic significance in Markdown.
  • Fixes a bug where we were not respecting the value of Config::$modSettings['enablePostHTML'] when BBCode was disabled.

More info:

  • SMF\Parsers\MarkdownParser::parse() can be used to generate the following types of output:
    • HTML output that strictly adheres to the baseline GFM spec.
    • HTML output that matches the output that would be produced by the equivalent BBCode, which means customizations to BBCode output are automatically carried over to the equivalent Markdown output. This is the default output type.
    • The equivalent BBCode (i.e. it can convert Markdown into BBCode).
  • Our MarkdownParser is more accurate (and thus restrictive) when identifying absolute URIs than the spec. For example, the spec will accept obviously invalid URIs like a+b+c:d, made-up-scheme://foo,bar, or http://../. SMF is capable of much more robust URI validation, so we use it.
  • Whereas Markdown hypothetically allows any arbitrary HTML to be embedded into documents, SMF in fact restricts the allowed HTML in user input just like it always has. Thus, users that paste Markdown containing disallowed HTML into their posts will see that HTML turned into plain text; it will not be rendered.
  • Although a CommonMark extension exists for PHP itself, we don't use it because:
    1. It is not installed by default in most versions of PHP.
    2. It doesn't support tables or strikethrough text, which we want for the sake of better feature parity with BBCode.
    3. We need to do some special handling in order to make BBCode and Markdown play nicely with each other, and it was much harder to accomplish that using the PHP extension than it was to just write our own parser.

@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Dec 1, 2024

Hm. I used to have a complete unit testing file for this back when I was working on it during the summer, but now I can't find it. Anyone who wants one can just grab a copy of the GFM spec, extract all the examples it gives, and run them through SMF\Parsers\MarkdownParser::parse() with the output type set to OUTPUT_HTML_STRICT.

@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Dec 1, 2024

As an additional unit test, specifically focussed on ensuring the BBCode and Markdown play nicely together, try out this raw post text:

(Note, change the occurrences of &#x60; to backtick characters before running the test.)

<div>
<a href="http://example.com">example</a>
</div>




[table][tr][td]BBCode[/td]
[td]Table[/td]
[/tr]
[tr][td]foo[/td]
[td]bar[/td]
[/tr]
[/table]


| Markdown | Table |
| ------------ | ------ |
| baz | qux|


derp derp
derpy doo

> Lorem ipsum dolor
sit amet.
>
> - Qui *quodsi iracundia*
> - aliquando id

1. Lorem ipsum dolor
  sit amet.
2. Qui *quodsi iracundia*
  aliquando id

Lorem ipsum dolor
sit amet.
Qui *quodsi iracundia*
aliquando id


[center]derp *foo**bar**baz* derp[/center]

[code]
some    tab


*code*

[/code]


[linktext](http://example.com)

[reflink]

[tt]inline code[/tt]

[code=php]$foo = 'bar';[/code]


[php]	$foo = 'bar';[/php]

Derp	derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp	tabs.
Derp0000

&#x60;&#x60;&#x60;PHP
$baz = true;
&#x60;&#x60;&#x60;

[h1]BBC Heading 1[/h1]
derp
[h2]BBC Heading 2[/h2]
derp
[h3]BBC Heading 3[/h3]
derp
[h4]BBC Heading 4[/h4]
derp
[h5]BBC Heading 5[/h5]
derp
[h6]BBC Heading 6[/h6]
derp

# Markdown Heading 1
derp
## Markdown Heading 2
derp
### Markdown Heading 3
derp
#### Markdown Heading 4
derp
##### Markdown Heading 5
derp
###### Markdown Heading 6
derp

[reflink]: http://example.com

The expected HTML output when that is put into a forum post is:

                                                <p>
                                                    &lt;div&gt;
                                                    <a href="http://example.com" class="bbc_link" target="_blank" rel="noopener">example</a>

                                                    &lt;/div&gt;
                                                </p>
                                                <table class="bbc_table">
                                                    <thead>
                                                        <tr>
                                                            <th>
                                                                <p>BBCode</p>
                                                            </th>
                                                            <th>
                                                                <p>Table</p>
                                                            </th>
                                                        </tr>
                                                    </thead>
                                                    <tbody>
                                                        <tr>
                                                            <td>
                                                                <p>foo</p>
                                                            </td>
                                                            <td>
                                                                <p>bar</p>
                                                            </td>
                                                        </tr>
                                                    </tbody>
                                                </table>
                                                <table class="bbc_table">
                                                    <thead>
                                                        <tr>
                                                            <th>Markdown</th>
                                                            <th>Table</th>
                                                        </tr>
                                                    </thead>
                                                    <tbody>
                                                        <tr>
                                                            <td>baz</td>
                                                            <td>qux</td>
                                                        </tr>
                                                    </tbody>
                                                </table>
                                                <p>derp derp
                                                derpy doo</p>
                                                <blockquote class="bbc_standard_quote">
                                                    <cite>Quote</cite>

                                                    <p>Lorem ipsum dolor
                                                    sit amet.</p>
                                                    <ul class="bbc_list" style="list-style-type: disc;">
                                                        <li>
                                                            Qui 
                                                            <em>quodsi iracundia</em>
                                                        </li>
                                                        <li>aliquando id</li>
                                                    </ul>
                                                </blockquote>
                                                <ol class="bbc_list" style="list-style-type: decimal;">
                                                    <li>Lorem ipsum dolor
                                                    sit amet.</li>
                                                    <li>
                                                        Qui 
                                                        <em>quodsi iracundia</em>

                                                        aliquando id
                                                    </li>
                                                </ol>
                                                <p>
                                                    Lorem ipsum dolor
                                                    sit amet.
                                                    Qui 
                                                    <em>quodsi iracundia</em>

                                                    aliquando id
                                                </p>
                                                <div class="centertext">
                                                    <div class="inline-block">
                                                        <p>
                                                            derp 
                                                            <em>
                                                                foo
                                                                <strong>bar</strong>
                                                                baz
                                                            </em>
                                                             derp
                                                        </p>
                                                    </div>
                                                </div>
                                                <div class="codeheader">
                                                    <span class="code">Code</span>
                                                    <a class="codeoperation smf_select_text">Select</a>
                                                    <a class="codeoperation smf_expand_code hidden" data-shrink-txt="Shrink" data-expand-txt="Expand">Expand</a>
                                                </div>
                                                <code class="bbc_code">
                                                    some&nbsp; &nbsp; tab
                                                    <br>
                                                    <br>
                                                    <br>
                                                    *code*
                                                    <br>
                                                    <br>
                                                </code>
                                                <p>
                                                    <a href="http://example.com" class="bbc_link" target="_blank" rel="noopener">linktext</a>
                                                </p>
                                                <p>
                                                    <a href="http://example.com" class="bbc_link" target="_blank" rel="noopener">reflink</a>
                                                </p>
                                                <p>
                                                    <code class="bbc_tt">inline code</code>
                                                </p>
                                                <div class="codeheader">
                                                    <span class="code">Code</span>
                                                     (PHP) 
                                                    <a class="codeoperation smf_select_text">Select</a>
                                                    <a class="codeoperation smf_expand_code hidden" data-shrink-txt="Shrink" data-expand-txt="Expand">Expand</a>
                                                </div>
                                                <code class="bbc_code">
                                                    <span style="color: #0000BB">$foo </span>
                                                    <span style="color: #007700">= </span>
                                                    <span style="color: #DD0000">&#039;bar&#039;</span>
                                                    <span style="color: #007700">;</span>
                                                    <span style="color: #0000BB"></span>
                                                </code>
                                                <p>
                                                    <code class="phpcode">
                                                        <span style="color: #0000BB">
                                                            <span style="white-space: pre-wrap;"></span>
                                                        </span>
                                                        <span style="color: #0000BB">$foo </span>
                                                        <span style="color: #007700">= </span>
                                                        <span style="color: #DD0000">&#039;bar&#039;</span>
                                                        <span style="color: #007700">;</span>
                                                        <span style="color: #0000BB"></span>
                                                    </code>
                                                </p>
                                                <p>
                                                    Derp
                                                    <span style="white-space: pre-wrap;"></span>
                                                    derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp derp
                                                    <span style="white-space: pre-wrap;"></span>
                                                    tabs.
                                                    Derp0000
                                                </p>
                                                <div class="codeheader">
                                                    <span class="code">Code</span>
                                                     (PHP) 
                                                    <a class="codeoperation smf_select_text">Select</a>
                                                    <a class="codeoperation smf_expand_code hidden" data-shrink-txt="Shrink" data-expand-txt=""></a>
                                                </div>
                                                <code class="bbc_code">
                                                    <span style="color: #0000BB">$baz </span>
                                                    <span style="color: #007700">= </span>
                                                    <span style="color: #0000BB">true</span>
                                                    <span style="color: #007700">;</span>
                                                    <span style="color: #0000BB"></span>
                                                </code>
                                                <h5 class="bbc_h1">BBC Heading 1
                                                </h5>
                                                <p>derp</p>
                                                <h6 class="bbc_h2">BBC Heading 2
                                                </h6>
                                                <p>derp</p>
                                                <div class="bbc_h3">BBC Heading 3
                                                </div>
                                                <p>derp</p>
                                                <div class="bbc_h4">BBC Heading 4
                                                </div>
                                                <p>derp</p>
                                                <div class="bbc_h5">BBC Heading 5
                                                </div>
                                                <p>derp</p>
                                                <div class="bbc_h6">BBC Heading 6
                                                </div>
                                                <p>derp</p>
                                                <h5 class="bbc_h1">Markdown Heading 1</h5>

                                                <p>derp</p>
                                                <h6 class="bbc_h2">Markdown Heading 2</h6>

                                                <p>derp</p>
                                                <div class="bbc_h3">Markdown Heading 3</div>

                                                <p>derp</p>
                                                <div class="bbc_h4">Markdown Heading 4</div>

                                                <p>derp</p>
                                                <div class="bbc_h5">Markdown Heading 5</div>

                                                <p>derp</p>
                                                <div class="bbc_h6">Markdown Heading 6</div>

                                                <p>derp</p>

Screenshot 2024-12-01 at 3 31 19 PM Screenshot 2024-12-01 at 3 31 34 PM Screenshot 2024-12-01 at 3 31 44 PM

@jdarwood007
Copy link
Member

I know you have already done lots of work, but it feels a bit backward if we add code parsing BBC and then markdown everywhere it needs. Anyone using our code would also need to update theirs to parse markdown.

I wonder if it's appropriate to add a 'parser' class that would have an enum (at least in 8.1+, but constants for 8.0) that defines whether you want BBC, markdown, or both.

Something like:

\SMF\Parser->parse($body, Parser::PARSE_BBC | Parser::PARSE_MARKDOWN);

Backward compatible function calls just define to use BBC. Mod authors who are supporting 3.0 code natively, can just call the parser and get both automatically.

If we support more/different options in the future, it will be easier to just add it.

@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Dec 2, 2024

An abstract class and an interface might be a good idea, yes.

However, when I was writing this I found that it did not work well to conflate BBCode parsing and Markdown parsing into a single call. Most of the time it is best to parse BBCode and then Markdown, but sometimes it is best to do things in the opposite order. So we need the flexibility to call them in the order that we need. There's also the problem that both parsers need various parameters passed to them, but those sets of parameters have nothing to do with each other. Attempting to munge them all together as one big set of parameters to a single function call would be nasty in a number of different ways.

Nevertheless, you have raised a very important point that I hadn't thought about. The approach I've taken so far in this PR would indeed mean that "Anyone using our code would also need to update theirs to parse markdown," which doesn't work well at all with our goals regarding backward compatibility in 3.0. You are quite right that this is a problem that needs to be addressed. I'll have to think some more about the best way to do so... 🤔

@jdarwood007
Copy link
Member

Well, without mods supporting 3.0 native, the code is OOB is backward compatible. We just didn't enable a new feature. Mods would have to build a native 3.0 code to get the MD support.

The feature is complete as is and doesn't need overhaul changes if you don't want to. I would have personally wanted to extend the messages table to add a 'format' column with 1/default = Modern BBC and 2 = markdown and then have an option on the posting page to switch between bbc/md, Some sort of 'version' or other column would also be added to indicate how be how we handle HTML entities as we need to clean that up from the various times we clean and then unclean entries to validate things during inputs. It would be versioned to the SMF version the input was written on and allow our parser to know how to handle cleanups/processing based on us changing the input sanitation rules/other fixes. A scheduled task could even be written to 'upgrade' old posts to newer versions to reduce the amount of parsing old posts go through for security checks.

Also fixes WYSIWYG bugs with the php BBCode
Signed-off-by: Jon Stovell <[email protected]>
This is complicated, because we need to adjust the HTML tags in the output depending on the context in which the HTML is shown.

Signed-off-by: Jon Stovell <[email protected]>
Also changes the content of the tab substitute string to something better and unique.
Signed-off-by: Jon Stovell <[email protected]>
@Sesquipedalian Sesquipedalian changed the title Adds Markdown support Adds Markdown support, separates smiley parsing from BBC parsing, adds SMF\Parser base class and subclasses Dec 6, 2024
@Sesquipedalian
Copy link
Member Author

I took another crack at abstracting our parsing in order to handle the process of transforming one markup type into another. I seem to have solved it this time.

@Sesquipedalian Sesquipedalian marked this pull request as draft December 6, 2024 21:08
@Sesquipedalian Sesquipedalian marked this pull request as ready for review December 6, 2024 21:09
@Sesquipedalian Sesquipedalian force-pushed the 3.0/markdown branch 3 times, most recently from 4a04c89 to 8f5f02f Compare December 6, 2024 23:14
@jdarwood007
Copy link
Member

I like it. I didn't expect the smilies to become a parser, but that makes complete sense. There's a triple win here (Markdown, Parser customization, and smiley parsing separation).

@Sesquipedalian
Copy link
Member Author

Sesquipedalian commented Dec 7, 2024

This was planned for Alpha 4, but I don't want to worry about conflicts, so if no one objects, I'd like to merge it soon.

// Allow mods access before parsing.
$smileys = !empty($input_types & self::INPUT_SMILEYS);

IntegrationHook::call('integrate_pre_parsebbc', [&$string, &$smileys, &$options['cache_id'], &$options['parse_tags']]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we call the pre_parsebbc here or in the BBC logic?

Copy link
Member Author

@Sesquipedalian Sesquipedalian Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these two hooks here so that hooked functions could have access to the input string before any changes were made and after all changes were made, which is what these hooks were expected to do back when BBCode parsing was the only type of parsing. If I had left them in BBCodeParser::parse(), hooked functions would have been called in the midst of the overall parsing process, which would have had negative effects depending on what the hooked function tried to do.

If I were creating the hooks from scratch I would have named them integrate_pre_parse and integrate_post_parse (without the trailing bbc). But since the hooks already existed with these names, they continue to have those names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense in terms of keeping compatibility with older code.

@dragomano
Copy link
Contributor

You can add new things while leaving the old ones:

/* @deprecated since 3.0 */
IntegrationHook::call('integrate_pre_parsebbc', [&$string, &$smileys, &$options['cache_id'], &$options['parse_tags']]);

IntegrationHook::call('integrate_pre_parse', [&$string, &$smileys, &$options['cache_id'], &$options['parse_tags']]);

@Sesquipedalian Sesquipedalian merged commit ba5d10a into SimpleMachines:release-3.0 Dec 8, 2024
6 checks passed
@Sesquipedalian Sesquipedalian deleted the 3.0/markdown branch December 8, 2024 19:17
Comment on lines +924 to +927
editor.toggleSourceMode();
editor.toggleSourceMode();
editor.sourceEditorCaret({start: caretPos, end: caretPos});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this code do? I'm having trouble understanding what behavoir it adds or modifies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the lines that toggle the mode and then set the caret position, I found that the caret ended up jumping to an unexpected position (the start of the line, IIRC).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants