-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with reddit-parser #59
Comments
I personally could never get it to work, so I ended up just programming my own. |
Could you send me the code? I have no idea how to do parsing. |
Let me take a look at your data set and I might be able to help you out. My previous parser was for the Cornell corpus but I can look in to making one for your case. I emailed the person on the link you provided so I can find out how that data is structured. |
Ok. Thanks a lot. You're awesome. |
Thanks, we will keep in touch. |
Thanks, that might help. the data seems to be a similar layout. |
Yep, got that.I just downloaded some testing data to look at and start coding. This parser shouldn't be too complicated. |
Thanks SO much.
…On Sun, 1 Sep 2019 at 19:12, DSMJR ***@***.***> wrote:
Yep, got that.I just downloaded some testing data to look at and start
coding. This parser shouldn't be too complicated.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#59?email_source=notifications&email_token=AML4CLMV7MEJOUR7BU2LPE3QHRK47A5CNFSM4ISVGFXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5UNTKQ#issuecomment-526965162>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AML4CLKLZRFM2TLJQRQCARDQHRK47ANCNFSM4ISVGFXA>
.
|
No problem. |
Some of the datasets on that website are corrupted after extraction. I didn't find any corrupted ones in the "daily" directory. Done with the parser, if you have any errors with it, do tell. It skips over deleted comments and some with no comments. I also included my cornell corpus parser, if you are interested. And also a link to a pre-parsed reddit dataset. Parser ---- https://drive.google.com/file/d/1YgDZrQGJXZybXAo_5_4SZXycBFUJ3jCo/view?usp=sharing pre-parsed data ---- https://drive.google.com/uc?id=1s77S7COjrb3lOnfqvXYfn7sW_x5U1_l9&export=download cornell corpus ---- https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html |
You're the best. What data did you pre-parse? |
It was the pre-parsed data in the README of this project. It seemed to be decent data so I sent it in case you weren't aware of it👍. |
I'm sorry, I'm getting an error while parsing. It seems to just be a simple unicode decode error; Traceback (most recent call last): |
I'm parsing RC_2017-08 (around 15 gigs) |
let me take a look |
Strange, I used "RC_2005-12" (because it was the smallest in size and I wanted to get data quickly to test) and I didn't have any problems with it. |
Huh. I'm going to check out the difference between the files. :\ |
Nevermind. I got it to work by changing the encoding to latin1. I'm not sure if it will cause problems in the future, though. Some people say that some encoding makes the output gibberish. Anyway, new error: Traceback (most recent call last): |
add print(f) just above that line to see what data is in the variable. |
['BZh91AY&SYåØ;\x90\x06Úv߀\x7f\x90\x7fÿÿúÿÿÿÿÿÿÿÿb³Gß[ì0õ\xa0\x1a\x00h\x1a\x00ÑX\x14}\x06ûX\x01è\x00¢|Ã)õ÷O\x1dÜ\x0fT¥\x00\x02¨RJÐ\x00\x00Á4\x03Az,[\x00:\x1e¤\x0e@\x02€h:\x00\x1d"h\x01vtª«°\n'] I need to get some sleep. See you tomorrow. |
I Think that those characters might be a consequence of changing encoding. I'm downloading your dataset (RC_2017-08) to see. I will try to fix the problem. |
I downloaded the file (RC_2017-08) https://files.pushshift.io/reddit/comments/RC_2017-08.bz2 |
Yes, we have the same file. Did you change anything in the code from the zip you sent me? |
No changes. |
What did you enter for file name? |
oh, and also. What python are you using? |
I can't find a reason why it doesn't work, it should work. How big is your file extracted and un-extracted? |
file name : "RC_2017-08" |
I think running it over night is a good idea. I can't think of any reason why it wouldn't be working. I think it is just going to take a while to first process your datasets. I am interested to hear what it is doing in the morning. This has been a long issue I am glad I've been able to help so far. |
You are the reason I didn't just search for some pre parsed data ( ͡° ͜ʖ ͡°) |
Ha. I am happy to help. python is a VERY amazing language. |
Friend thinks JavaScript is better; writes an alphazero RL bot |
Javascript is a good language BUT python is way better. |
Loading vocab file...
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. During handling of the above exception, another exception occurred: Traceback (most recent call last):
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. Caused by op 'gradients/rnnlm_1/while/rnnlm/partitioned_multi_rnn_cell/cell_2_0/gru_cell/MatMul_1_grad/MatMul_1', defined at: ...which was originally created as op 'rnnlm_1/while/rnnlm/partitioned_multi_rnn_cell/cell_2_0/gru_cell/MatMul_1', defined at: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4096,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. |
I put output.txt and output 1.bz2-output 5.bz2 in a reddit folder inside data. Is that correct? |
Yes you had them in the right spot. This OOM error might be an easy fix. |
Show me line 30 and 24 in train.py. |
30: parser.add_argument('--batch_size', type=int, default=40, 24: parser.add_argument('--num_blocks', type=int, default=2, |
Change line 30 to: parser.add_argument('--batch_size', type=int, default=10, |
Loading vocab file... |
In the directory models/new_save delete all the data in that folder.And in the data folder delete all .npz files and all .pkl files. And run it again (it is going to take a while to run like before). |
This time |
No vocab file found. Preprocessing...
|
no it's normal. |
And you got rid of all the files that the program made previously? |
yes. |
I didn't delete anything and tried again and it works now! :/ |
i'm gonna keep this open until the model is fully trained, k? |
Ok sounds good. I guess it resolved itself. if it is resolved then don't forget to mark the issue closed. |
Glad I could help! |
How much loss is acceptable? |
depends on how well it does in inference mode. The lower the better. |
Glad I could help! Enjoy. |
I think it deserves a new issue, but error with training: |
I am getting an error while training my own reddit data from this website, https://files.pushshift.io/reddit/comments/
2017-8.
Trying it the first time:
Traceback (most recent call last):
File "reddit_parse.py", line 258, in
main()
File "reddit_parse.py", line 37, in main
parse_main(args)
File "reddit_parse.py", line 91, in parse_main
args.print_subreddit, args.min_conversation_length)
File "reddit_parse.py", line 242, in write_comment_cache
output_file.write(output_string + '\n')
File "reddit_parse.py", line 151, in write
self.file_reference.write(data)
File "C:\Users\16175\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f602' in position 404: character maps to
Second time, I added 'encoding=utf8' to line 151:
Traceback (most recent call last):
File "reddit_parse.py", line 258, in
main()
File "reddit_parse.py", line 37, in main
parse_main(args)
File "reddit_parse.py", line 91, in parse_main
args.print_subreddit, args.min_conversation_length)
File "reddit_parse.py", line 242, in write_comment_cache
output_file.write(output_string + '\n')
File "reddit_parse.py", line 151, in write
self.file_reference.write(data, encoding='utf8')
TypeError: write() takes no keyword arguments
Python 3.6.8
Tensorflow 1.9.0
Could someone please help me?
The text was updated successfully, but these errors were encountered: