You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
\xE2\x82\xAC corresponds to € and \xC3\xA4 corresponds to ä.
Depending on what the implementation of #escape_pre() in the formatter is, this can have no consequence like in the RedCloth::Formatters::LATEX case (where we just return the passed argument, which eventually gets concatenated again, restoring the multi-byte sequence) or completely break the formatter (if you actually need to do some processing inside #escape_pre to prepare the input).
Steps to reproduce
I've prepared the smallest formatter I could get to work to show this issue:
input =<<EOI
Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.
bc. Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.
p. Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.
bc.. Lorem ipsum dolor sit amet, €, ä, Ã, È consectetur adipiscing elit.
EOI
puts RedCloth.new(input).to(SimpleFormatter)
You'll see the wrong tokenization in the bc. and bc.. parts of the input.
Pointers
As far as I can tell, the tokenization of the input happens in C land and not Ruby, and I suspect it's doing some sort of "for each byte in the input". So ideally that should be changed to "for each character in the input".
At the moment I don't have a clue about how to do this, any pointers would be tremendously appreciated.
The text was updated successfully, but these errors were encountered:
When a given input is about to be parsed by the formatter either
#escape
or#escape_pre
will be called depending on the context.When the input has UTF-8 characters like:
RedCloth behaves differently in the
#escape
/#escape_pre
scenarios.#escape
Input is tokenized by word as expected and UTF-8 sequences are respected
escape_pre
Instead of tokenizing by word, here RedCloth tries to tokenize by character, but without accounting for multi-byte UTF-8 sequences.
Multi-byte UTF-8 sequences are split and each byte is sent to the escaping function individually:
\xE2\x82\xAC
corresponds to€
and\xC3\xA4
corresponds toä
.Depending on what the implementation of
#escape_pre()
in the formatter is, this can have no consequence like in the RedCloth::Formatters::LATEX case (where we just return the passed argument, which eventually gets concatenated again, restoring the multi-byte sequence) or completely break the formatter (if you actually need to do some processing inside#escape_pre
to prepare the input).Steps to reproduce
I've prepared the smallest formatter I could get to work to show this issue:
If you exercise it with something like:
You'll see the wrong tokenization in the
bc.
andbc..
parts of the input.Pointers
As far as I can tell, the tokenization of the input happens in C land and not Ruby, and I suspect it's doing some sort of "for each byte in the input". So ideally that should be changed to "for each character in the input".
At the moment I don't have a clue about how to do this, any pointers would be tremendously appreciated.
The text was updated successfully, but these errors were encountered: