-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loss being 0 is a caveat of model_max_length
#43
Comments
The comment is so helpful! |
model_max_length
when using LLaVA-Onevisionmodel_max_length
I'm still getting a loss of 0 after |
@sunfanyunn import json
import os
from tqdm import tqdm
import torch
torch.set_printoptions(profile="full", linewidth=240)
from torch.utils.data import DataLoader
from transformers import AutoProcessor, AutoTokenizer
from datasets import LazySupervisedDataset
from collators import COLLATORS
from loaders import LOADERS
from supported_models import MODEL_HF_PATH
model_id = "llava-onevision-0.5b-ov"
model_family_id = "llava-onevision"
dataset = LazySupervisedDataset(
data_path='./example_data/single_image.json', # use your own data here
image_folder='./example_data/images',
video_folder='./example_data/videos',
model_family_id=model_family_id,
)
_, tokenizer, processor, config = LOADERS[model_family_id](
model_hf_path=MODEL_HF_PATH[model_id],
model_local_path=MODEL_HF_PATH[model_id],
compute_dtype=torch.float16,
).load(load_model=False)
tokenizer.model_max_length = 4096
collator = COLLATORS[model_family_id](
config=config,
processor=processor,
tokenizer=tokenizer
)
dataloader = DataLoader(dataset, batch_size=2, collate_fn=collator)
batch = next(iter(dataloader))
print(batch["input_ids"])
print()
print(batch["labels"])
print()
print(tokenizer.batch_decode(batch["input_ids"], skip_special_tokens=False))
print(tokenizer.decode(
batch["labels"][1][torch.where(batch["labels"][1] != -100)[0]], skip_special_tokens=True
)) |
I realized my input images (of size 1080 x 1080) are being tokenized into 7371 tokens. @zjysteven am I missing anything obvious? |
I am aware, thank you! But, I think my images are represented in a single-image way even when I provide multiple images |
Oh I see the point now. I briefly browsed through the huggingface's preprocessing code but didn't notice a point where it's distinguished into "single image" and "multi-image" with different preprocessing; it seems to me that currently all images are processed with "anyres" which results in what you saw here. Meanwhile I do see that in official implementation's training https://github.com/LLaVA-VL/LLaVA-NeXT/blob/79ef45a6d8b89b92d7a8525f077c3a3a9894a87d/llava/train/train.py#L1140-L1148 single-image and multi-images are distinguished in training. if "image" in sources[0]:
image_file = self.list_data_dict[i]["image"]
if type(image_file) is list:
image = [self.process_image(f) for f in image_file]
# Handling multi images
# overwrite to process with simple pad
if len(image_file) > 1:
image = [self.process_image(f, "pad") for f in image_file]
image = [[im[0], im[1], "image"] for im in image] Tagging @zucchini-nlp to see if she can kindly confirm this and has any idea. |
Hey all! Yes, you're right, currently HF implementation doesn't distinguish between single vs mutii-image setting. AFAIR inference in orig impl also did not, unless I am missing anything as many things changed in the course of porting the model. I can check out later the inference in original repo I am sorry HF doesn't support training same way as in the paper. The reason is that I tried to reduce complexity for the model, and aimed at inference-first use cases. We can add single vs multi image difference but it would require padding/unpadding similar to Mllama in that case adding more code to work with. Let's me how the inference works in original repo and I'll come back to this issue, right now a bit short of bandwidth |
Thank you! |
one thing i would like to share:
|
model_max_length
model_max_length
TL;DR: Set a large enough
model_max_length
(e.g., 2048, 4096, or even larger) when finetuning, or otherwise you will be likely to see training loss always being 0.Today we have enabled finetuning of LLaVA-Onevision in lmms-finetune. There is a quite subtle caveat, though, that's worth mentioning.
In earlier versions of transformers (can't remember exactly, but must be some point before 4.45.2),
model_max_length
only counts the number of text tokens without considering the vision tokens. Taking LLaVA-1.5 as an example, where each image will be translated into 576 tokens when being sent to the LLM. It means that if you setmodel_max_length
to be 128, then with a prompt including an image, your input sequence length will essentially be 128 - 1 + 576 = 703.Recently transformers implementations start to include the the vision tokens into
model_max_length
, where you can see it here https://github.com/huggingface/transformers/blob/3f06f95ebe617b192251ef756518690f5bc7ff76/src/transformers/models/llava/processing_llava.py#L143-L164.Such processing requires some arguments/keywords from the processor's config, which, as of Oct 16, hasn't been updated for LLaVA-1.5/1.6/Interleave/Next-Video.The latest LLaVA-Onevision, however, is fully compatible with this new change, which means thatmodel_max_length
will include all vision tokens. As a result, remember to set a large enoughmodel_max_length
when finetuningLLaVA-Onevisionevery model, or otherwise you probably will see loss being 0 all the time as all input tokens could be vision tokens.I hope that I have made this clear enough, but feel free to leave questions if there are any.
The text was updated successfully, but these errors were encountered: