-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with train.py - chatset errors. #44
Comments
hi, I'm having the same problem when I'm running train.py on new data. |
This might not be the right solution but...here is a patch for that. |
Yea @neofob changes the encodings the utils are using to read the training sets, but this should match which encodings you used to write the training data as well. (i.e if your training files are encoded with utf-8, they should be read in utf-8) Although this allows for training I'm not too sure if the char-rnn works with utf-8 encodings at all since I am just getting gibberish back from the model when trained this way. (karpathy/char-rnn#113) |
Any news? Same problem here. The @neofob patch doesn't work for me: I guess it's because I am using the same @pender reddit dataset (https://github.com/pender/chatbot-rnn) |
You just need to make sure the data you're training on is encoded in ANSI. If your parser must read and write in a different encoding, just save the output text file as ANSI and it should be useable. Clearly certain characters cannot be mapped, but the percentage of those characters seems too small to make a difference. |
@neofob @zhou-daniel-dz I try figure out how make char-rnn work with utf-8 but simple path |
no just bad or wrong format |
of bz2 or txt file or file renamed from zst |
Any thoughts? I am using windows..
Preprocessing file 2/6 (reddit-parse/output\output 1.bz2)... Traceback (most recent call last): File "train.py", line 190, in <module> main() File "train.py", line 49, in main train(args) File "train.py", line 55, in train data_loader = TextLoader(args.data_dir, args.batch_size, args.seq_length) File "D:\bot\utils.py", line 39, in __init__ self._preprocess(self.input_files[i], self.tensor_file_template.format(i)) File "D:\bot\utils.py", line 107, in _preprocess data = file_reference.read() File "D:\python\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 23267: character maps to <undefined>
The text was updated successfully, but these errors were encountered: