Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always reformat tab characters as space characters #262

Open
jamesquilty opened this issue Sep 7, 2021 · 10 comments
Open

Always reformat tab characters as space characters #262

jamesquilty opened this issue Sep 7, 2021 · 10 comments
Labels
enhancement New feature or request

Comments

@jamesquilty
Copy link

jamesquilty commented Sep 7, 2021

Description / Summary

The tab character (0x09) is a pest.

Currently, mdformat seeks to "apply consistent white space across the board" (Formatting Style: Whitespace) and does the right thing when tab characters appear as leading white space for indentation: it eliminates the pest by replacing them with the appropriate number of space characters for indentation. Line-trailing tabs are also eliminated.

Unfortunately, tab characters in heading and paragraph bodies, where HTML white space collapse will apply when the HTML is rendered for display, are not eliminated by collapsing them into a single space character. I think that tab characters should be eliminated in this context, also, because tabs cause problems.

I believe there are three contexts where tab characters might appear and there's a case for elimination in each:

Context Action
1. Line-leading white space for Indentation Eliminate, replacing with the appropriate number of space characters. This is the current behaviour.
2. Wherever HTML white space collapse applies Eliminate, replacing with a single space character. This is the proposed enhancement.
3. Wherever HTML white space collapse will not apply, e.g. Code spans, Fenced code blocks Either (a) preserve, allowing the HTML renderer to determine how to display tab characters in <code> or <pre> blocks or (b) expand to the appropriate number of space characters.

I'd propose that always eliminating tab characters and replacing them with the appropriate number of space characters is the way to "apply consistent white space across the board" and that the current mixed treatment of tab characters is inconsistent with mdformat's style goals. Mixed tabs and spaces are seldom good.

There might be an open question with regard to (3), above, because CSS might change the width of tab characters rendered in <code> or <pre> or other HTML blocks?

Value / benefit

  • Consistent treatment of white space introduced by tab characters.
  • Markdown source will more closely resemble the rendered output.
  • Avoids display problems in editors caused by differing tab width settings and different invisible white space.
  • Produces output which will not violate Markdownlint rule MD010 - No Hard tabs

Implementation details

I think that modifying the TextWrapper instance attributes here

https://github.com/executablebooks/mdformat/blob/a856f538e2dcb81a83e9013fb073f16cd6e53972/src/mdformat/renderer/_context.py#L330-L336

to

        expand_tabs=True,
        tabsize=1

will achieve the desired white space collapse of a tab character to a space character, but it won't help in collapsing multiple tab-and-space character runs into a single space. The replace_whitespace instance attribute would seem to affect all white space characters and not just tab characters.

Tasks to complete

No response

@jamesquilty jamesquilty added the enhancement New feature or request label Sep 7, 2021
@jamesquilty
Copy link
Author

The CommonMark Spec 0.20: Preprocessing used to specify:

Tabs in lines are immediately expanded to spaces, with a tab stop of 4 characters:

but this was changed in version 2.1 onward. I'm not sure what the motivation was for the change, but there two relevant issues on the CommonMark GitHub Project: commonmark-spec#386 and commonmark-spec#318.

It's probably worth noting that any tab characters that ever find their way into my Markdown documents are introduced by copy-and-paste and aren't there intentionally.

@jgopel
Copy link

jgopel commented Feb 3, 2023

Has any progress been made on this? I'm very interested in this feature and I'd be open to making a PR if there's interest.

@jamesquilty
Copy link
Author

@jgopel, I'm still interested in this feature. I don't think that it's been implemented independently of this Issue.

@jgopel
Copy link

jgopel commented Feb 10, 2023

@hukkin Would you be interested in merging this if I were to make a PR for it?

@hoijui

This comment was marked as off-topic.

@jamesquilty

This comment was marked as off-topic.

@hoijui

This comment was marked as off-topic.

@hukkin
Copy link
Owner

hukkin commented Oct 18, 2024

Hey, yeah this feature is welcome provided we test extensively to make sure rendered output never changes in some obscure corner case. Fenced code blocks should not be touched.

@hukkin
Copy link
Owner

hukkin commented Nov 18, 2024

#466 does the conversion for text inlines. I think this should be done to code spans too, so I'll leave this open.

@jamesquilty
Copy link
Author

@hukkin, Thanks for accepting this and opening a PR! Just doing (2), above, will be a big step forward.

I think that it might create odd situation if code spans and fenced code blocks were treated differently. My comment (above) links to [then-current] prior discussion on the topic in the commonmark-spec repository. I'm not sure whether the handling of tabs remains an open question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants