Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong tokenization of UTF-8 sequences for escape_pre #34

Open
etdsoft opened this issue Jan 26, 2017 · 0 comments
Open

Wrong tokenization of UTF-8 sequences for escape_pre #34

etdsoft opened this issue Jan 26, 2017 · 0 comments

Comments

@etdsoft
Copy link

etdsoft commented Jan 26, 2017

When a given input is about to be parsed by the formatter either #escape or #escape_pre will be called depending on the context.

When the input has UTF-8 characters like:

Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.

RedCloth behaves differently in the #escape / #escape_pre scenarios.

#escape

Input is tokenized by word as expected and UTF-8 sequences are respected

...
#escape() - "dolor"
#escape() - " "
#escape() - "sit"
#escape() - " "
#escape() - "amet"
#escape() - ","
#escape() - " "
#escape() - "€"
#escape() - ","
#escape() - " "
#escape() - "ä"
...

escape_pre

Instead of tokenizing by word, here RedCloth tries to tokenize by character, but without accounting for multi-byte UTF-8 sequences.

Multi-byte UTF-8 sequences are split and each byte is sent to the escaping function individually:

...
#escape_pre() - "a"
#escape_pre() - "m"
#escape_pre() - "e"
#escape_pre() - "t"
#escape_pre() - ","
#escape_pre() - " "
#escape_pre() - "\xE2"
#escape_pre() - "\x82"
#escape_pre() - "\xAC"
#escape_pre() - ","
#escape_pre() - " "
#escape_pre() - "\xC3"
#escape_pre() - "\xA4"
#escape_pre() - ","
#escape_pre() - " "
...

\xE2\x82\xAC corresponds to and \xC3\xA4 corresponds to ä.

Depending on what the implementation of #escape_pre() in the formatter is, this can have no consequence like in the RedCloth::Formatters::LATEX case (where we just return the passed argument, which eventually gets concatenated again, restoring the multi-byte sequence) or completely break the formatter (if you actually need to do some processing inside #escape_pre to prepare the input).

Steps to reproduce

I've prepared the smallest formatter I could get to work to show this issue:

module SimpleFormatter
  include RedCloth::Formatters::Base

  def bc_close(opts); ''; end

  def bc_open(opts); ''; end

  def code(opts); opts[:text] || ''; end

  def escape(text)
    puts "#escape() - #{text.inspect}"
    text
  end

  def escape_pre(text)
    puts "#escape_pre() - #{text.inspect}"
    text
  end

  def p(opts)
    opts[:text]
  end


  def method_missing(method, *opts)
    puts method
    # puts opts
    ''
  end
end

If you exercise it with something like:

input =<<EOI
Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.

bc. Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.

p. Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.

bc.. Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.
EOI

puts RedCloth.new(input).to(SimpleFormatter)

You'll see the wrong tokenization in the bc. and bc.. parts of the input.

Pointers

As far as I can tell, the tokenization of the input happens in C land and not Ruby, and I suspect it's doing some sort of "for each byte in the input". So ideally that should be changed to "for each character in the input".

At the moment I don't have a clue about how to do this, any pointers would be tremendously appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant