Too much blankspace gets stripped #91

hugopeixoto · 2020-09-15T03:01:39Z

Coming from forem/forem#8457, I think that ReverseMarkdown strips too much blankspace in some scenarios. Here's a failing test:

  it 'keeps whitespace surrounding links' do
    result = ReverseMarkdown.convert("a\n<a href='1'>link</a>\nis good\nbut blankspace is better")
    expect(result).to eq "a [link](1) is good but blankspace is better\n\n"
  end

The output is a[link](1)is good but blankspace is better\n\n. This happens because the text converter calls remove_border_newlines, and the fact that the middle line is a link means that it will be its own nokogiri node, and the three nodes will be joined with no whitespace. I tried changing remove_border_lines to squeeze instead of removing everything, but this doesn't work: it keeps whitespace in scenarios where it shouldn't:

first<p>second</p>third becomes first\n\nsecond\n\n third\n\n.

I still want to investigate this further, but I decided to post this now to share my findings.

The text was updated successfully, but these errors were encountered:

hugopeixoto · 2020-09-15T12:43:30Z

I was curious as to why this didn't happen with <strong> tags. The following test does succeed:

it 'keeps whitespace surrounding bolds' do
  result = ReverseMarkdown.convert("a\n<strong>potato</strong>\nis good\nbut blankspace is better")
  expect(result).to eq "a **potato** is good but blankspace is better\n\n"
end

After some puts debuggering, I see that this one also gets its blankspace trimmed in the first pass, but then cleaner.tidy adds it back again in clean_tag_borders:

converters.convert: "\n\na**potato**is good but blankspace is better\n\n"
cleaner.tidy: "a **potato** is good but blankspace is better\n\n"

So it seems that we're losing information during convert, and assuming that ** would always be preceeded by a blankspace, which gets added in tidy (clean_tag_borders). The same technique wouldn't work for links, since the surrounding blankspace doesn't need to be there for markdown links.

My previous attempt of changing remove_border_lines to use squeeze doesn't work because it would break this test:

it 'squeezes whitespace when using paragraph tags' do
  result = ReverseMarkdown.convert("first paragraph\n<p>second paragraph</p>\nthird\nin separate lines")
  expect(result).to eq "first paragraph\n\nsecond paragraph\n\nthird in separate lines\n\n"
end

with the following diff:

 first paragraph

 second paragraph

-third in separate lines
+ third in separate lines

hugopeixoto mentioned this issue Sep 28, 2020

RSS/HTML import doesn't handle newlines forem/forem#8457

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too much blankspace gets stripped #91

Too much blankspace gets stripped #91

hugopeixoto commented Sep 15, 2020

hugopeixoto commented Sep 15, 2020

Too much blankspace gets stripped #91

Too much blankspace gets stripped #91

Comments

hugopeixoto commented Sep 15, 2020

hugopeixoto commented Sep 15, 2020