Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation is suspiciously slow for long sequences #23

Open
avivbrokman opened this issue Aug 11, 2023 · 4 comments
Open

Generation is suspiciously slow for long sequences #23

avivbrokman opened this issue Aug 11, 2023 · 4 comments

Comments

@avivbrokman
Copy link

avivbrokman commented Aug 11, 2023

I am trying to use BioMedLM for generation, but I find that it is very slow at generation for long sequences. Training occurs at a normal speed. I wrote a minimal program (below) to reproduce this, comparing it to GPT-2 (1.5B parameters) and Flan T5-XL (3B parameters) for comparison. I varied the maximum generation length value, and estimated the ratio of the durations of the decoder models (BioMedLM divided by GPT-2):

1024 tokens: 5.9
512 tokens: 3.2
256 tokens: 1.9
128 tokens: 1.3
64 tokens: 1.01

Anecdotally, the generation speed is similar to that of Flan UL2, a 20B parameter model.

I'd like to fix this—I don't know if the issue is in the the BioMedLM code, my software/environment versions/settings, or my hardware A100-80GB.

import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
from datetime import datetime

# settings
max_length = 1024

# text
text = 'SRY1 phosphorylates'

# flan-t5-xl - 3B - encoder-decoder model
checkpoint = 'google/flan-t5-xl'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

inputs = tokenizer(text, return_tensors = 'pt')

model = model.to('cuda')
inputs = inputs.to('cuda')

t0 = datetime.now()
output = model.generate(**inputs, max_length = min(512, max_length)
t1 = datetime.now()

print('flan-t5 generation length: ', len(output[0]))
print('flan-t5 duration: ', t1 - t0)

# gpt2 - 1.5B - decoder model
checkpoint = 'gpt2-xl'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

inputs = tokenizer(text, return_tensors = 'pt')

model = model.to('cuda')
inputs = inputs.to('cuda')

t2 = datetime.now()
output = model.generate(**inputs, max_length = max_length)
t3 = datetime.now()

print('GPT-2 generation length: ', len(output[0]) - inputs['input_ids'].size(1))
print('GPT-2 duration: ', t3 - t2)

# BioMedLM - 2.7B - decoder model
checkpoint = 'stanford-crfm/BioMedLM'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

inputs = tokenizer(text, return_tensors = 'pt')

model = model.to('cuda')
inputs = inputs.to('cuda')

t4 = datetime.now()
outputs = model.generate(**inputs, max_length = max_length)
t5 = datetime.now()

print('BioMedLM generation length: ', len(output[0]) - inputs['input_ids'].size(1))

print('BioMedLM duration: ', t5 - t4)
@J38
Copy link
Contributor

J38 commented Aug 16, 2023

Ok I'll do some experiments too and get back to you. Just to double check, you are giving the same prompt to GPT-2 and BioMedLM and running generate and those numbers are the ratio between the 2 models?

Just this week I have been spending a lot of time working on BioMedLM's generative abilities for downstream tasks ... I actually feel it is most useful for scenarios like reading a PubMed abstract and printing out a list of relations derived from the abstract for instance ...

@J38
Copy link
Contributor

J38 commented Aug 16, 2023

BioMedLM out of the box should just literally be running the same code as GPT-2 since it is just a GPT-2 model with different weights and different tokenizer ... it has a smaller vocabulary than GPT-2 ... we could also compare to GPT Neo 2.7B ...

@J38
Copy link
Contributor

J38 commented Aug 16, 2023

And what exactly are the input --> outputs ? Are BioMedLM and GPT-2 XL producing text of similar length or is there a difference in average output length? I don't think setting max_length necessarily determines the average length of outputs, so if one model had a tendency to print out longer responses to inputs it would possibly take longer?

@avivbrokman
Copy link
Author

Just to double check, you are giving the same prompt to GPT-2 and BioMedLM and running generate and those numbers are the ratio between the 2 models?

Yes, to both

I actually feel it is most useful for scenarios like reading a PubMed abstract and printing out a list of relations derived from the abstract

I laughed when I read this, because I'm doing this exactly. I just wanted to provide a minimal example.

BioMedLM out of the box should just literally be running the same code as GPT-2 since it is just a GPT-2 model with different weights and different tokenizer

This is what I expected—and why I'm confused about the difference in speed.

Are BioMedLM and GPT-2 XL producing text of similar length or is there a difference in average output length?

For my minimal example, they are producing lengths within 2 tokens of each other, so I don't think sequence length accounts for it (also my code prints out number of generated tokens). I'm guessing this is a special tokens difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants