Created by Kelsey Glenn
This project was designed to help automate the process of story concept creation by training a neural language model to generate plot synopses. The generator is built by fine-tuning OpenAI’s GPT-2 language model on roughly 14,000 Anime plot synopses.
I use Anime synopsis data from MyAnimeList to train the model in hopes that the heavy use of story tropes and otherwise “predictable” nature of Anime story premises will make them well-suited to coherent generation.
I additionally create a regression model to predict MyAnimeList user scores as a proxy for story quality. The model uses Random Forest regression on NMF-generated topic-document matrices. After semi-randomly generating a large number of texts, I apply the model to assist in “curating” those of high quality.
Finally, code for a small deployment via Steamlit is included which allows users to generate text from their own starting seed as well as utilize the score prediction model.
A brief warning: this project includes data containing potentially explicit material and may contain inappropriate or sensitive content.
Accessory Modules
- clean_generations.py - function for cleaning generator output file
- clean_text.py - function for cleaning training data
- generator.py - functions for custom synopsis generation
- score_text.py - function for using score prediction model on a synopsis
Notebooks
- finetuning_and_generation.ipynb - GPT-2 fine-tuning and initial generation
- store_and_score.ipynb - bulk synopsis generation and scoring
- gpt-2-medium-pytorch.ipynb - **deprecated**, scrapped GPT-2 Medium model training w/ PyTorch
- hyperparameter_tuning.ipynb - tuning score prediction model hyperparameters
- score_prediction.ipynb- experiments and optimization of score prediction model
Training Data
- Anime.csv - initial dataset containing synopses, scores and other data, courtesy of [Adrian L. Ludosan](https://www.kaggle.com/aludosan/myanimelist-anime-dataset-as-20190204).
- gpt_training_data.csv - cleaned training texts stored as a single column CSV for GPT-2