-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle unicode text #131
base: master
Are you sure you want to change the base?
Handle unicode text #131
Conversation
This can get pretty tricky with text encodings. My preference is to always operate with unicode, because then iterating over a string is guaranteed to iterate over a "letter" instead of iterating over parts of multi-byte characters. That said, I haven't been very careful about enforcing this! This is additionally complicated by the fact that Py2 and Py3 have different defaults for handling strings. I personally use Py3 but I try to test everything with Py2 as well (see the Travis config). Which version of Python are you using? Can you try using a "unicode" object instead of a UTF-8 encoded byte sequence to see if this problem persists? Can you add a test to run a unicode object through the recurrent infrastructure and add it to this PR? Also, this PR breaks an existing test, please fix. |
Thanks for taking a look, I will push some changes soon to address the issues |
with codecs.open(path, 'r', 'utf-8') as handle:
file_data = handle.read().lower()
text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):]) or using a with open(path, 'r') as handle:
file_data = unicode(handle.read(), 'utf-8').lower()
text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):]) |
Attempting to use
theanets.recurrent.Text
on a UTF8 encoded corpus used to give an errorThis is fixed by this PR.
This change is