-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mdformat inappropriately replaces HTML entity references with the corresponding Unicode character #261
Comments
Thanks for the issue! Yeah HTML entities are replaced mainly because the parser does so by default so it was the path of least resistance. I don't oppose to not doing so, but if we make the change we must test it extensively for any regressions. Having the entities as one codepoint is probably less prone to bugs because they are more likely to go through all of the processing (charater escaping, word wrapping etc.) as the same one codepoint, whereas a multi character representation is more likely to be accidentally broken. The current behavior could also be seen as a feature of sorts: if your text accidentally contains a valid entity, you'll be able to notice it before rendering and add an escape. Also, Markdown source that is correctly "mdformatted" will more resemble rendered output. |
And yeah, changing this will need changes in the parser, not only mdformat. I haven't looked into the linked extension that deeply but I assume something like that is needed. I don't think adding a |
I appreciate your quick reply! I was testing over the weekend and I think that the current behaviour isn't as consistent as would be desired. White space entity references show some anomalies, the most serious is the replacement of A count of the number of named entity references in my Markdown documents, only the minority written by me, found the most frequently used entity reference is Other anomalies include the replacement of named entity references with numeric entity references. A very artificial example: InputSome white space ‏ ‎ ‍ ‌
     
Space is stripped from the front and back...  OutputSome white space
 
 Space is stripped from the front and back...  Note: What can't be seen, because white space is invisible, is that |
Looking into the white space handling, the As the comments indicate, this is trying to un-parse to recover the Markdown source, based on an assumption that the source specified a numeric entity reference rather than a named entity reference. That's a bit problematic, because the source may have specified the white space in a different way. |
As a hack to work-around the entity replacement until markdownit-py is enhanced, I'm considering adding a configurable option which replaces certain Unicode characters with HTML entities, based on the HTML entities in my existing Markdown files (see below). Feedback on this approach would be welcome. It's easy to generate a candidate list of HTML entities to replace Unicode characters: $ find . -name "*.md" -exec grep --extended-regexp --only-matching "&\w+;" {} \; | sort | uniq -c | sort --reverse
40
12 μ
12 Ω
9 ×
7 °
3 ⋅
3 ±
2 →
1 ≈ |
So, looking into this, it rapidly became apparent that it will be much easier to disable the substitution than it will be to reverse substitutions. |
Draft fix committed as jamesquilty@14f4122, which calls |
Describe the problem
HTML entity references are valid Markdown according to the CommonMark Spec https://spec.commonmark.org/0.30/#entity-and-numeric-character-references and are useful for specifying punctuation characters which are otherwise difficult or impossible to enter via the keyboard, and are particularly helpful when specifying special whitespace characters which would otherwise be invisible.
Unfortunately,
mdformat
replaces entity references with their corresponding Unicode characters, which it probably should not because they are valid Markdown under the CommonMark Spec. The replacement appears to be an artefact of the CommonMark Spec parsing requirements, and causes problems: it produces a formatted Markdown document which contains Unicode characters which are difficult to edit and may be difficult to identify. For example, the display difference between —, – and - will vary depending on context, may be minimal in some contexts, and may cause searches to fail in ways which are really confusing. Whitespace characters are, of course, invisible, which can be problematic in itself, and it's not clear to me that whitespace entity references which happened to occur at the beginning or end of a line and were replaced by Unicode whitespace characters will not then be subsequently removed bymdformat
.I would suggest that
mdformat
should not change entity and numeric references.This should be easy to do, in principle, by disabling entity and numeric reference substitution in markdown-it-py when it's called... if the markdown-it#manage-rules facility has been ported? If it's not available, then perhaps porting the markdown-it-html-entities plugin will be necessary.
Link to your repository or website
No response
Steps to reproduce
The version of Python you're using
No response
Your operating system
No response
Versions of your packages
No response
Additional context
The reasoning behind the CommonMark Spec requiring parsers to replace most entity references with their corresponding Unicode characters has been lost from the 3.0 spec, but can be seen in the 2.3 spec change log https://spec.commonmark.org/0.23/changes.html and in commonmark/commonmark-spec#137 (comment) by John MacFarlane from 2014:
A couple of years later, John raised commonmark/commonmark-spec#442 to reconsider this:
markdown-it has a number of Issues related to entities and those which ask about disabling the entity reference replacement are typically advised to use
.disable('entity')
or the markdown-it-html-entities plugin:The text was updated successfully, but these errors were encountered: