Review prediction

Join our Discord >> https://discord.gg/a2Z82Te

Review prediction

Introduction

The aim of this experiment is to investigate the performance of

different NN approaches
different graph representations of the same data

on a simple synthetic prediction task.

The Task

We model personalised recommendations as a system containing people, products and recommendations. In our system every product has a style and each person has a style preference. People can make reviews of products. In our system the review score will be a function Y(...) of the person's style preference and the product's style. We call this function the opinion function i.e.:

review_score = Y(product_style, person_style_preference)

We will generate data using this model. We will then use this synthetic data to investigate how effective various ML approaches on the data set are at learning the behaviour of this system.

If necessary we can change the opinion function Y(...) to increase or decrease the difficulty of the task.

The Synthetic Data

The synthetic data for this task can be varied in various ways:

Change which information is hidden e.g. we could hide product_style, style_preference or both.
Change the representation of the key properties e.g. reviews/styles and preferences could be boolean, categorical, continuous scalars or even multi dimensional vectors.
Change how the data is represented as a graph e.g. reviews could be nodes in their own right, or they could be edges with properties, product_style could be a property on a product node or product_style could be a seperate node connected to a product node by a HAS_STYLE relationship (edge).
Add additional meaningless or semi-meaningless information to the training data.

We will generate different data sets to qualitatively investigate different ML approaches on the same basic system.

Evaluation Tasks

We are interested in four different evaluation tasks depending on whether the person or product is included in the training set or not:

new product == unknown at training time i.e. not in training set or validation set
new person == unknown at training time i.e. not in training set or validation set
existing product == known at training time i.e. present in training set
existing person == known at training time i.e. present in training set

The evaluation tasks we are interested in are, how well can you predict the person's review? Given:

new product and new person
existing product and new person
new product and existing person
existing product and existing person

Approach

Although we have a synthetic system for which we can generate more data we want to get into good habits for working with "real" data. So we will attempt to blind the ML system to the fact that we are working with synthetic data and not rely on our ability to generate more information at will.

It will be the responsibility of the ML part of the system to split the data into Test / Train and Validation sets. However for each data set that we generate we will keep back a small portion to make up a "golden" test set which is only to be used at the very end of our investigation. This is to perform a final test of the ML predictor, one which we haven't had the opportunity to optimise the meta-parameters for.

Because of the three different evaluation tasks it will be necessary for us to keep back three different golden test sets, of a large enough size to test the system regardless of the test/training split. We will keep the following volumes of golden test data:

INDEPENDENT: A completely independent data set containing 1000 reviews
NEW_PEOPLE: new people + their reviews of existing products containing approx 2000 reviews
NEW_PRODUCTS: new products + reviews of them by existing people containing approx 2000 reviews
EXISTING: 2000 additional reviews between existing people and products.

The Data Sets

Data Set 1: A simple binary preference system

Products have a binary style and people have a binary preference.

All variables will be 'public' in the data set

Product Style

product_style will be categorical with two mutually exclusive elements (A and B).
The distribution of product styles will be uniform i.e. Approx 50% of products will have style A and 50% will have style B.

Style Preference

person_style_preference will be categorical with two mutually exclusive elements (likes_A_dislikes_B | likes_B_dislikes_A ).
The distribution of product styles will be uniform i.e. Approx 50% of people will like style A and 50% will like style B.

Reviews and Opinion Function

review_score will be boolean (1 for a positive review and 0 for a negative review)
Each person will have made either 1 or 2 reviews. The mean number of reviews-per-person will be approx 1.5 i.e. approx 50% will have made 2 reviews and 50% will have made 1 review.
review_score is the dot product of the product_style and person_style_preference normalised to the range of 0 to 1

Note: having people with 0 reviews would be useless since you cannot train or validate/test using them.

Note: fixing the number of reviews-per-person would restrict the graph structure too much and open up the problem to approaches that we aren't interested in right now.

Entity Ratios and Data Set Size

I basically made these up. Intuitively the reviews-per-product and reviews-per-person parameters affect how much we can infer about people/product hidden variables. I like the idea of those figures being very different so we can see how systems cope with that distinction.

people:products = 50:1
people:reviews = 1:1.5
reviews:products = 75:1

Data set size: 12000 reviews / 160 products / 8000 people

n.b. because we assign the reviews randomly some products may not have reviews, but it is relatively unlikely.

Graph Schema

Data generation algorithm

Instantiate all products for public data set and write to Neo, keeping an array of the ids.
Iteratively instantiate people, decide how many reviews that person will have made (probabilistically)
For each review that the person has to make randomly choose a product to review (without replacement)
Calculate the review score and submit the Person + their reviews to Neo
Read the data back out of neo and validate the entity ratios
Create the golden test sets:

NEW_PEOPLE: create 2000/reviews_per_person new people + their reviews of randomly selected (with replacement) existing products.
NEW_PRODUCTS: create 2000/reviews_per_product new products, have randomly selected (with replacement) people review them.
EXISTING randomly pick 2000 people (with replacement) have each of them review a randomly selected (with replacement) product
INDEPENDENT is easy, but best to leave till last to avoid confusion - just repeat the basic data generation from scratch

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.idea		.idea
bin		bin
config		config
data_sets/synthetic_review_prediction		data_sets/synthetic_review_prediction
experiment		experiment
graph_ml		graph_ml
output		output
test		test
.floydexpt		.floydexpt
.floydignore		.floydignore
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
floyd_requirements.txt		floyd_requirements.txt
test.sh		test.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Review prediction

Introduction

The Task

The Synthetic Data

Evaluation Tasks

Approach

The Data Sets

Data Set 1: A simple binary preference system

Product Style

Style Preference

Reviews and Opinion Function

Entity Ratios and Data Set Size

Graph Schema

Data generation algorithm

About

Releases

Packages

Contributors 3

Languages

License

Octavian-ai/experiments

Folders and files

Latest commit

History

Repository files navigation

Review prediction

Introduction

The Task

The Synthetic Data

Evaluation Tasks

Approach

The Data Sets

Data Set 1: A simple binary preference system

Product Style

Style Preference

Reviews and Opinion Function

Entity Ratios and Data Set Size

Graph Schema

Data generation algorithm

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages