Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where are the samples of automated evaluation? #19

Open
Guaguago opened this issue May 20, 2020 · 19 comments
Open

Where are the samples of automated evaluation? #19

Guaguago opened this issue May 20, 2020 · 19 comments

Comments

@Guaguago
Copy link

Thanks for your reply, I have written an program to calculate perplexity by hugging-face transformers interface.
But I am not sure which samples are used for perplexity calculation.

@Guaguago
Copy link
Author

Are those samples in the [ human_anotation/pplm_labled_csvs ] directory?

@ehsan-soe
Copy link

Hi,
@Guaguago Can you share how you calculate PPL?

@Guaguago
Copy link
Author

Guaguago commented May 20, 2020

@ehsan-soe hi

import math
import torch
from transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer

tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()

def score(sent):
    indexed_tokens = tokenizer.encode(sent)
    tokens_tensor = torch.tensor([indexed_tokens])
    with torch.no_grad():
        outputs = model.forward(tokens_tensor, labels=tokens_tensor)
    loss = outputs[0]
    return math.exp(loss.item())

sents =['there is a book on the desk',
    'there is a plane on the desk',
    'there is a book in the desk']
print([score(s) for s in sents])

@Guaguago
Copy link
Author

@ehsan-soe do you know how can I use this code to get the perplexity scores of paper?

@Guaguago
Copy link
Author

@dathath Sorry I can't find any samples generated in this repository, can you give me the specified location or some instructions on how can I use this code to get the perplexity scores of PPLM?

@ehsan-soe
Copy link

@Guaguago Thanks.
Perplexity is usually calculated for the test set, However maybe the authors used the generated text to compute the perplexity since they don't have ground truth target.
In that case, you should either generate yourself or the author provide the generated samples.

@dathath
Copy link
Contributor

dathath commented May 20, 2020

@ehsan-soe You can compute the perplexity of the generated text with regard to another language model (GPT), which is what we do here.

@Guaguago human_annotation/pplm_labeled_csvs has the generated samples. You can read the csvs into python and then process the samples using GPT to compute perplexity.

@ehsan-soe
Copy link

@dathath Thanks. Can you correct me if I am wrong? are perplexities usually computed on the test set? that is using the NLL of the trained model on the target text, right?
However, since here you don't update the weights of GPT-2 and you don't have ground truth text, it doesn't make sense to follow the more conventional approach?

@Guaguago
Copy link
Author

@ehsan-soe Soga! Thank you!
@dathath Thank you! and I find that each item in the CSV file has 2 generated samples.
It seems that one sample is from PPLM and the other one is from a baseline, and their order seems random according to paper. So how do I select samples from different models respectively?

@dathath
Copy link
Contributor

dathath commented May 22, 2020

You can use the 'parse_*.ipynb' notebooks to process the CSVs. That should give you samples from different models separately.

@Guaguago
Copy link
Author

@dathath Thank you very much and I will try it! And I have made two programs to test PPL and distinct-n respectively according to your suggestions before. But the scores are not as same as the paper, so I want to do make a further check about 3 questions:

  1. The samples I used to calculate PPL and Dist-N are extracted from CSV files under human_anotation/pplm_labled_csvs directory: For each CSV file, I concatenate the first 2 columns to get 360+360=720 samples for each topic.
    Is this the same as the paper? or did the paper also uses these samples to calculate PPL and Dist-N?

  2. For Dist-n: Which tokenizer the paper choose to calculate Dist-n? is there any extra process added like stop words or drop punctuation?

  3. Which one did the paper choose: the sentence-level dist-n or corpus-level dist-n?

@dathath
Copy link
Contributor

dathath commented May 23, 2020

Are your scores in the same range as the paper?

  1. The 360 samples are for pairwise A/B testing from the ablation study -- it consists of 6 pairs, so if you take the 720 sequences you mention, you'll have four copies of each sample and be combining different modes of generation. You can use the parse script to separate out the 60 samples per topic (for each type of generation), and then measure perplexity. As it is, you're computing the average perplexity over all generation methods, but it should be of the same order.

  2. GPT2 tokenizer from huggingface.

  3. It's at the corpus level for a given topic across all prefixes. We want a measure of the diversity of the sentences generated across different prefixes and for different samples given a specific attribute. E. g., the model can't just satisfy the attribute by generating, " is very good" for every prefix, or when you sample repeatedly from a prefix.

@dathath dathath closed this as completed May 23, 2020
@dathath dathath reopened this May 23, 2020
@Guaguago
Copy link
Author

@dathath Thank you so much! Really helpful clue by which I have got the exactly same Dist-1,2,3 scores as paper. But for PPL, most of the scores I have got are a little bit greater than the paper's(about 0-1.5 error range), so I have some questions:

  1. Are the samples used to calculate dist scores and PPL the same?
  2. Which tokenizers for PPL the paper use?
  3. Is there anything missed in my PPL code as follows?
import math
import torch
from transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer

tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()

def score(sent):
    indexed_tokens = tokenizer.encode(sent)
    tokens_tensor = torch.tensor([indexed_tokens])
    with torch.no_grad():
        outputs = model.forward(tokens_tensor, labels=tokens_tensor)
    loss = outputs[0]
    return math.exp(loss.item())

@dathath
Copy link
Contributor

dathath commented May 24, 2020

def score(sentence, tokenizer, model):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)]).cuda()
    loss = model(tensor_input, lm_labels=tensor_input)
    return math.exp(loss)

tokenizer_LM = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model_LM = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')

Yes, the samples are the same. This is what we do, so should ideally match -- probably can try matching with the perplexities by topic/sentiment (see appendix). The Layer-norm layers used in the hugging-face implementation of transformers seem to have changed a little-bit between versions. I suspect this might be one possible cause for the discrepancy if you're using a recent version of "pytorch-transformers".

@Guaguago
Copy link
Author

Guaguago commented May 26, 2020

@dathath Is there any special process to the token "<|endoftext|>" and the '\n' within a sentence in the calculation of PPL?
Should I drop them before calculating ppl?

@Guaguago
Copy link
Author

Guaguago commented May 27, 2020

@dathath After having solved some bugs and warnings, I found that my results of PPL turn to much lower than the paper but the dist scores almost ideally match. I have tried different versions of transformers but the results unchanged. So could you correct me if there are some wrong steps in my process to separate out samples for each model as follows?

  1. Use a_cat, b_cat = decode(order)[1:] clue in parse script to assign each sample to the method who generates it.

  2. Construct set() for each method to remove duplicates, then for each B/BR/BC/BCR I got 60 samples respectively(but for “religion” I found it less than 60).

  3. For each sample, use sample.replace('<|endoftext|>'), '') to drop the prefix, and use GPT to assign a score.

  4. Compute the mean of 60 ppl scores to be the final ppl score for each B/BR/BC/BCR of a certain topic.

I need some help, appreciate your reply

@dathath
Copy link
Contributor

dathath commented May 27, 2020

Hi,
Sorry for the late response, it's been hectic. Overall what you are doing seems reasonable to me.

  1. We don't actually drop '<|endoftext|>'.

Can you drop me (and Andrea) an email? The issue is a little hard to follow here. I will try to respond by the weekend.

@Guaguago
Copy link
Author

Guaguago commented May 27, 2020

@dathath @Andrea Hi, thank you, you are so nice! This is my email: [email protected]
I need your help, please!

@Guaguago
Copy link
Author

Guaguago commented Jun 2, 2020

@dathath Hi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants