Handle unicode text #131

feynmanliang · 2016-07-19T14:48:03Z

Attempting to use theanets.recurrent.Text on a UTF8 encoded corpus used to give an error

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
/home/fl350/bachbot/scripts/theanet/theanet.py in <module>()
     24 with codecs.open(path, 'r', 'utf-8') as handle:
     25     file_data = handle.read().lower()
---> 26     text = theanets.recurrent.Text(file_data[:int(VAL_FRACTION*len(file_data))])
     27     text_val = theanets.recurrent.Text(file_data[int(VAL_FRACTION*len(file_data)):])
     28

/home/fl350/theanets/theanets/recurrent.py in __init__(self, text, alpha, min_count, unknown)
     89                 collections.Counter(text).items()
     90                 if char != unknown and count >= min_count)))
---> 91         print type(r'[^{}]'.format(re.escape(self.alpha)).encode('utf8'))
     92         self.text = re.sub(r'[^{}]'.format(re.escape(self.alpha)).encode('utf8'), unknown, text)
     93         assert unknown not in self.alpha

UnicodeEncodeError: 'ascii' codec can't encode character u'\x83' in position 85: ordinal not in range(128)

This is fixed by this PR.

This change is

coveralls · 2016-07-19T15:37:29Z

Coverage decreased (-0.1%) to 94.768% when pulling eaca433 on feynmanliang:text-handle-utf into b637b01 on lmjohns3:master.

lmjohns3 · 2016-07-20T02:34:48Z

This can get pretty tricky with text encodings. My preference is to always operate with unicode, because then iterating over a string is guaranteed to iterate over a "letter" instead of iterating over parts of multi-byte characters. That said, I haven't been very careful about enforcing this!

This is additionally complicated by the fact that Py2 and Py3 have different defaults for handling strings. I personally use Py3 but I try to test everything with Py2 as well (see the Travis config).

Which version of Python are you using? Can you try using a "unicode" object instead of a UTF-8 encoded byte sequence to see if this problem persists? Can you add a test to run a unicode object through the recurrent infrastructure and add it to this PR? Also, this PR breaks an existing test, please fix.

feynmanliang · 2016-07-20T11:00:07Z

Thanks for taking a look, I will push some changes soon to address the issues

feynmanliang · 2016-07-20T13:42:44Z

I'm using 2.7.3
I can repro with the following code (assuming path points to a file with utf8 encoded strings)

with codecs.open(path, 'r', 'utf-8') as handle:
    file_data = handle.read().lower()
    text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
    text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):])

or using a unicode object

with open(path, 'r') as handle:
    file_data = unicode(handle.read(), 'utf-8').lower()
    text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
    text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):])

Handle unicode text

eaca433

feynmanliang added 2 commits July 19, 2016 16:43

Fix for 3.4

d107a0b

Fixes utf8

27c016d

Use unicode object

22360a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle unicode text #131

Handle unicode text #131

feynmanliang commented Jul 19, 2016 •

edited

Loading

coveralls commented Jul 19, 2016

lmjohns3 commented Jul 20, 2016

feynmanliang commented Jul 20, 2016

feynmanliang commented Jul 20, 2016 •

edited

Loading

Handle unicode text #131

Are you sure you want to change the base?

Handle unicode text #131

Conversation

feynmanliang commented Jul 19, 2016 • edited Loading

coveralls commented Jul 19, 2016

lmjohns3 commented Jul 20, 2016

feynmanliang commented Jul 20, 2016

feynmanliang commented Jul 20, 2016 • edited Loading

feynmanliang commented Jul 19, 2016 •

edited

Loading

feynmanliang commented Jul 20, 2016 •

edited

Loading