Issue with train.py - chatset errors. #44

ghost · 2018-05-17T07:51:11Z

Any thoughts? I am using windows..

Preprocessing file 2/6 (reddit-parse/output\output 1.bz2)... Traceback (most recent call last): File "train.py", line 190, in <module> main() File "train.py", line 49, in main train(args) File "train.py", line 55, in train data_loader = TextLoader(args.data_dir, args.batch_size, args.seq_length) File "D:\bot\utils.py", line 39, in __init__ self._preprocess(self.input_files[i], self.tensor_file_template.format(i)) File "D:\bot\utils.py", line 107, in _preprocess data = file_reference.read() File "D:\python\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 23267: character maps to <undefined>

The text was updated successfully, but these errors were encountered:

sashasmirnova · 2018-05-27T18:38:25Z

hi, I'm having the same problem when I'm running train.py on new data.

neofob · 2018-05-31T14:42:00Z

This might not be the right solution but...here is a patch for that.
neofob@1f56cb9

zhou-daniel-dz · 2018-07-30T02:02:01Z

Yea @neofob changes the encodings the utils are using to read the training sets, but this should match which encodings you used to write the training data as well. (i.e if your training files are encoded with utf-8, they should be read in utf-8)

Although this allows for training I'm not too sure if the char-rnn works with utf-8 encodings at all since I am just getting gibberish back from the model when trained this way. (karpathy/char-rnn#113)

geroale · 2018-08-30T11:37:29Z

Any news? Same problem here.

The @neofob patch doesn't work for me: I guess it's because bz2.open errors="ignore" or errors="replace" param is not working.

I am using the same @pender reddit dataset (https://github.com/pender/chatbot-rnn)

zhou-daniel-dz · 2018-08-30T16:06:45Z

You just need to make sure the data you're training on is encoded in ANSI.

If your parser must read and write in a different encoding, just save the output text file as ANSI and it should be useable. Clearly certain characters cannot be mapped, but the percentage of those characters seems too small to make a difference.

remotejob · 2018-11-12T04:33:28Z

@neofob @zhou-daniel-dz I try figure out how make char-rnn work with utf-8 but simple path
in: utils.py
if input_file.endswith(".bz2"): file_reference = bz2.open(input_file, mode='rt', encoding="utf-8", errors="replace") elif input_file.endswith(".txt"): file_reference = io.open(input_file, mode='rt', encoding="utf-8", errors="replace")
Don't work for me probably it's not enough?

breadbrowser · 2022-04-27T11:50:30Z

no just bad or wrong format

breadbrowser · 2022-04-27T11:51:45Z

of bz2 or txt file or file renamed from zst

egg82 added a commit to egg82/chatbot-rnn that referenced this issue Mar 23, 2019

Fixed pender/chatbot-rnn pender#44

d65bceb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with train.py - chatset errors. #44

Issue with train.py - chatset errors. #44

ghost commented May 17, 2018

sashasmirnova commented May 27, 2018

neofob commented May 31, 2018

zhou-daniel-dz commented Jul 30, 2018 •

edited

Loading

geroale commented Aug 30, 2018 •

edited

Loading

zhou-daniel-dz commented Aug 30, 2018

remotejob commented Nov 12, 2018

breadbrowser commented Apr 27, 2022

breadbrowser commented Apr 27, 2022

Issue with train.py - chatset errors. #44

Issue with train.py - chatset errors. #44

Comments

ghost commented May 17, 2018

sashasmirnova commented May 27, 2018

neofob commented May 31, 2018

zhou-daniel-dz commented Jul 30, 2018 • edited Loading

geroale commented Aug 30, 2018 • edited Loading

zhou-daniel-dz commented Aug 30, 2018

remotejob commented Nov 12, 2018

breadbrowser commented Apr 27, 2022

breadbrowser commented Apr 27, 2022

zhou-daniel-dz commented Jul 30, 2018 •

edited

Loading

geroale commented Aug 30, 2018 •

edited

Loading