-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetuning BioMedLM for Medical QA #20
Comments
Could you provide some details. What type of GPU are you trying to run this on? How many GPUs? What is the GPU memory? Are you trying to generate sentence or paragraph answers? How many training examples do you have? |
Sure, I'm running it on a single GPU - Nvidia Gefore RTX 3080 Laptop GPU. GPU memory is 16GB. I'm finetuning the model using HuggingFace transformers library for question answering. I'm trying to generate sentence answers. To start off, I tried just using around 500 training examples. The dataset csv is of the format:
|
How much RAM does your laptop have? There are ways to train the model on a single GPU with cpu offloading but you need 50GB RAM on the machine as well I believe. I am hoping to update the code and post some new instructions on various fine-tuning scenarios. But it sounds like you want to process 500 prompt --> response pairs ... |
Or to put it another way ... I have gotten single GPU training working and it starts at 14GB on GPU and ends up at 18.8GB ... and it looks like it is using 50GB of RAM on the machine ... I am not sure what would happen with the resources you have ... |
Laptop has 24GB Ram. But, I'm using Low Rank Adaptation approach. I was able to finetune Repajama 3B model with same dataset on this laptop with Low Rank Adaptation(with and without 8bit optimisation). So, does
Mean that this is to do with the GPU and Ram? And sure, thanks. Updates on Fine tuning scenarios would really help. |
I will see if I can get the LoRA version working, I have never tried that ... |
Sure, thank you! |
Hi, @J38 . Actually, I encountered the same problem. I used the code in
|
What puzzles me is that I repeated the training three times, and each time I couldn't complete a full epoch. Moreover, the iterations where the failure occurred seemed quite random. I suspect it has something to do with my dataset and tokenizer. In fact, I came across some discussions where they suggested removing special characters from my dataset, but that didn't work either. Is there anything specific I can investigate? Thank you very much. @J38 |
@s1ghhh I encountered a similar error while working on a different dataset and I guess it is due to the input length of the model. As mentioned by @J38, the model was trained with a fixed context length of 1024, so the source, target, and extra tokens have to fit within that size. Try changing the size of the input train target length and see if it works. It worked for me though. |
Hi,
I'm trying to finetune the BioMedLM for Medical Question Answering using our custom dataset using Hugging Face's transformer's library. Since we're looking to optimize the memory usage, we're using Low Rank Adaptation as well.
I'm unsure of the format of the dataset that I need to use.
Below is the one I'm using currently:
{ 'instruction': 'xyz', 'output': 'test'}, where instruction is the question and output is the answer.
Below is my code - ```
import logging
import torch
from datasets import Dataset
import pandas as pd
import gc
from transformers import DataCollatorForLanguageModeling
from transformers import TrainingArguments, Trainer, AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType
logging.basicConfig(level=logging.DEBUG)
#--------------------------------------------------------------------------------------------------------------
print("creating tokenizer from model")
model_name="stanford-crfm/BioMedLM"
tokenizer = AutoTokenizer.from_pretrained(model_name,add_eos_token=True)
tokenizer.pad_token_id = 0
tokenizer.add_special_tokens({'eos_token':''})
print('eos_token_id:',tokenizer.eos_token_id)
device_type = "cuda" if torch.cuda.is_available() else "cpu"
device = torch.device(device_type)
model = AutoModelForCausalLM.from_pretrained(
model_name,
).to(device)
model.tie_weights() # todo: understand we dont know what this doing
#--------------------------------------------------------------------------------------------------------------
peft_name = 'output/biomedLM-lora'
CUTOFF_LEN = 512
def tokenize(prompt, tokenizer, add_eos_token=True):
result = tokenizer(
prompt+"", # add the end-of-stream token
truncation=True,
max_length=CUTOFF_LEN,
padding="max_length",
)
return {
"input_ids": result["input_ids"],
"attention_mask": result["attention_mask"],
}
print("loading data from csv")
df = pd.read_csv("dataset.csv")
dataset = Dataset.from_pandas(df)
dataset = dataset.select_columns(['instruction', 'output'])
print("splitting dataset")
dataset = dataset.train_test_split(test_size = 0.33)
train_data = dataset["train"]
val_data = dataset["test"]
def generate_prompt(data_point):
return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
Instruction:
{data_point["instruction"]}
Response:
{data_point["output"]}"""
print("tokenizing train and val ds")
train_data = train_data.shuffle().map(lambda x: tokenize(generate_prompt(x), tokenizer))
val_data = val_data.shuffle().map(lambda x: tokenize(generate_prompt(x), tokenizer))
lora_config = LoraConfig(
r = 8,
lora_alpha=16,
target_modules=["c_attn"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
eval_steps = 50
save_steps = 50
logging_steps = 20
trainer = Trainer(
model=model,
train_dataset=train_data,
eval_dataset=val_data,
args=TrainingArguments(
num_train_epochs=1,
learning_rate=1e-5,
logging_steps=logging_steps,
evaluation_strategy="steps",
save_strategy="steps",
eval_steps=eval_steps,
save_steps=save_steps,
output_dir="./models", # where model is saved
report_to="none",
save_total_limit=3,
load_best_model_at_end=True,
push_to_hub=False,
per_device_train_batch_size=1, # defines per batch size
per_device_eval_batch_size=1
),
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), # wtf is this
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
print("training")
trainer.train()
print("saving model")
trainer.model.save_pretrained(peft_name)
tokenizer.save_pretrained(peft_name)
#--------------------------------------------------------------------------------------------------------------../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [472,0,0], thread: [126,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [472,0,0], thread: [127,0,0] Assertion
srcIndex < srcSelectDimSize
failed.print("cleanup")
model = None
tokenizer=None
trainer=None
gc.collect()
torch.cuda.empty_cache()
#--------------------------------------------------------------------------------------------------------------
The text was updated successfully, but these errors were encountered: