Wrong tokenization of UTF-8 sequences for `escape_pre` #34

etdsoft · 2017-01-26T12:23:01Z

When a given input is about to be parsed by the formatter either #escape or #escape_pre will be called depending on the context.

When the input has UTF-8 characters like:

Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.

RedCloth behaves differently in the #escape / #escape_pre scenarios.

#escape

Input is tokenized by word as expected and UTF-8 sequences are respected

...
#escape() - "dolor"
#escape() - " "
#escape() - "sit"
#escape() - " "
#escape() - "amet"
#escape() - ","
#escape() - " "
#escape() - "€"
#escape() - ","
#escape() - " "
#escape() - "ä"
...

escape_pre

Instead of tokenizing by word, here RedCloth tries to tokenize by character, but without accounting for multi-byte UTF-8 sequences.

Multi-byte UTF-8 sequences are split and each byte is sent to the escaping function individually:

...
#escape_pre() - "a"
#escape_pre() - "m"
#escape_pre() - "e"
#escape_pre() - "t"
#escape_pre() - ","
#escape_pre() - " "
#escape_pre() - "\xE2"
#escape_pre() - "\x82"
#escape_pre() - "\xAC"
#escape_pre() - ","
#escape_pre() - " "
#escape_pre() - "\xC3"
#escape_pre() - "\xA4"
#escape_pre() - ","
#escape_pre() - " "
...

\xE2\x82\xAC corresponds to € and \xC3\xA4 corresponds to ä.

Depending on what the implementation of #escape_pre() in the formatter is, this can have no consequence like in the RedCloth::Formatters::LATEX case (where we just return the passed argument, which eventually gets concatenated again, restoring the multi-byte sequence) or completely break the formatter (if you actually need to do some processing inside #escape_pre to prepare the input).

Steps to reproduce

I've prepared the smallest formatter I could get to work to show this issue:

module SimpleFormatter
  include RedCloth::Formatters::Base

  def bc_close(opts); ''; end

  def bc_open(opts); ''; end

  def code(opts); opts[:text] || ''; end

  def escape(text)
    puts "#escape() - #{text.inspect}"
    text
  end

  def escape_pre(text)
    puts "#escape_pre() - #{text.inspect}"
    text
  end

  def p(opts)
    opts[:text]
  end


  def method_missing(method, *opts)
    puts method
    # puts opts
    ''
  end
end

If you exercise it with something like:

input =<<EOI
Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.

bc. Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.

p. Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.

bc.. Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.
EOI

puts RedCloth.new(input).to(SimpleFormatter)

You'll see the wrong tokenization in the bc. and bc.. parts of the input.

Pointers

As far as I can tell, the tokenization of the input happens in C land and not Ruby, and I suspect it's doing some sort of "for each byte in the input". So ideally that should be changed to "for each character in the input".

At the moment I don't have a clue about how to do this, any pointers would be tremendously appreciated.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong tokenization of UTF-8 sequences for `escape_pre` #34

Wrong tokenization of UTF-8 sequences for `escape_pre` #34

etdsoft commented Jan 26, 2017

Wrong tokenization of UTF-8 sequences for escape_pre #34

Wrong tokenization of UTF-8 sequences for escape_pre #34

Comments

etdsoft commented Jan 26, 2017

#escape

escape_pre

Steps to reproduce

Pointers

Wrong tokenization of UTF-8 sequences for `escape_pre` #34

Wrong tokenization of UTF-8 sequences for `escape_pre` #34