Parse Reddit Corpus in zip format #67

wmodes · 2020-11-08T22:53:55Z

There are a few folks seeding the entirety of Reddit, but the Reddit Corpus project provides archives of individual subreddits. This gives you the very useful ability to train in a particular domain. Here is a small example: dadjokes2.corpus.zip

The only problem is that they are not in the same format as your reddit_parse.py expects. They are zipped (.zip) in a bundle of five JSON files consisting of:

users.json
conversations.json
corpus.json
index.json
utterances.jsonl

What is the shortest path for converting this to useable training data?

Wes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Reddit Corpus in zip format #67

Parse Reddit Corpus in zip format #67

wmodes commented Nov 8, 2020

Parse Reddit Corpus in zip format #67

Parse Reddit Corpus in zip format #67

Comments

wmodes commented Nov 8, 2020