loss being 0 is a caveat of `model_max_length` #43

zjysteven · 2024-10-17T01:48:22Z

TL;DR: Set a large enough model_max_length (e.g., 2048, 4096, or even larger) when finetuning, or otherwise you will be likely to see training loss always being 0.

Today we have enabled finetuning of LLaVA-Onevision in lmms-finetune. There is a quite subtle caveat, though, that's worth mentioning.

In earlier versions of transformers (can't remember exactly, but must be some point before 4.45.2), model_max_length only counts the number of text tokens without considering the vision tokens. Taking LLaVA-1.5 as an example, where each image will be translated into 576 tokens when being sent to the LLM. It means that if you set model_max_length to be 128, then with a prompt including an image, your input sequence length will essentially be 128 - 1 + 576 = 703.

Recently transformers implementations start to include the the vision tokens into model_max_length, where you can see it here https://github.com/huggingface/transformers/blob/3f06f95ebe617b192251ef756518690f5bc7ff76/src/transformers/models/llava/processing_llava.py#L143-L164. ~~Such processing requires some arguments/keywords from the processor's config, which, as of Oct 16, hasn't been updated for LLaVA-1.5/1.6/Interleave/Next-Video.~~ The latest LLaVA-Onevision, however, is fully compatible with this new change, which means that model_max_length will include all vision tokens. As a result, remember to set a large enough model_max_length when finetuning ~~LLaVA-Onevision~~ every model, or otherwise you probably will see loss being 0 all the time as all input tokens could be vision tokens.

I hope that I have made this clear enough, but feel free to leave questions if there are any.

The text was updated successfully, but these errors were encountered:

98986oiuoy · 2024-10-18T07:45:36Z

The comment is so helpful!

sunfanyunn · 2024-10-24T21:00:00Z

I'm still getting a loss of 0 after --model_max_length to 4096 or more (only with llava-onevision). Are there other reasons that could be causing this?

zjysteven · 2024-10-24T21:39:41Z

@sunfanyunn model_max_length might still be small w.r.t. your input. You can use this script to examine the output of collator to see if model_max_length is large enough.

import json
import os
from tqdm import tqdm

import torch
torch.set_printoptions(profile="full", linewidth=240)
from torch.utils.data import DataLoader
from transformers import AutoProcessor, AutoTokenizer

from datasets import LazySupervisedDataset
from collators import COLLATORS
from loaders import LOADERS
from supported_models import MODEL_HF_PATH

model_id = "llava-onevision-0.5b-ov"
model_family_id = "llava-onevision"

dataset = LazySupervisedDataset(
    data_path='./example_data/single_image.json', # use your own data here
    image_folder='./example_data/images',
    video_folder='./example_data/videos',
    model_family_id=model_family_id,
)

_, tokenizer, processor, config = LOADERS[model_family_id](
    model_hf_path=MODEL_HF_PATH[model_id],
    model_local_path=MODEL_HF_PATH[model_id],
    compute_dtype=torch.float16,
).load(load_model=False)
tokenizer.model_max_length = 4096
collator = COLLATORS[model_family_id](
    config=config,
    processor=processor,
    tokenizer=tokenizer
)

dataloader = DataLoader(dataset, batch_size=2, collate_fn=collator)

batch = next(iter(dataloader))
print(batch["input_ids"])
print()
print(batch["labels"])
print()
print(tokenizer.batch_decode(batch["input_ids"], skip_special_tokens=False))
print(tokenizer.decode(
    batch["labels"][1][torch.where(batch["labels"][1] != -100)[0]], skip_special_tokens=True
))

sunfanyunn · 2024-10-24T23:46:42Z

I realized my input images (of size 1080 x 1080) are being tokenized into 7371 tokens. @zjysteven am I missing anything obvious?

zjysteven · 2024-10-24T23:52:16Z

This is from the llava onevision paper, which shows that the maximum number of tokens for one image is 7290. Although yours is slightly more (probably with some marking tokens like newline tokens), I don't think there is anything wrong. This is what I meant earlier that your model_max_length may not be large enough.

sunfanyunn · 2024-10-25T00:38:45Z

I am aware, thank you! But, I think my images are represented in a single-image way even when I provide multiple images

zjysteven · 2024-10-25T01:00:31Z

Oh I see the point now. I briefly browsed through the huggingface's preprocessing code but didn't notice a point where it's distinguished into "single image" and "multi-image" with different preprocessing; it seems to me that currently all images are processed with "anyres" which results in what you saw here.

Meanwhile I do see that in official implementation's training https://github.com/LLaVA-VL/LLaVA-NeXT/blob/79ef45a6d8b89b92d7a8525f077c3a3a9894a87d/llava/train/train.py#L1140-L1148 single-image and multi-images are distinguished in training.

        if "image" in sources[0]:
            image_file = self.list_data_dict[i]["image"]
            if type(image_file) is list:
                image = [self.process_image(f) for f in image_file]
                # Handling multi images
                # overwrite to process with simple pad 
                if len(image_file) > 1:
                    image = [self.process_image(f, "pad") for f in image_file]
                    image = [[im[0], im[1], "image"] for im in image]

Tagging @zucchini-nlp to see if she can kindly confirm this and has any idea.

zucchini-nlp · 2024-10-25T06:50:09Z

Hey all!

Yes, you're right, currently HF implementation doesn't distinguish between single vs mutii-image setting. AFAIR inference in orig impl also did not, unless I am missing anything as many things changed in the course of porting the model. I can check out later the inference in original repo

I am sorry HF doesn't support training same way as in the paper. The reason is that I tried to reduce complexity for the model, and aimed at inference-first use cases. We can add single vs multi image difference but it would require padding/unpadding similar to Mllama in that case adding more code to work with.

Let's me how the inference works in original repo and I'll come back to this issue, right now a bit short of bandwidth

sunfanyunn · 2024-10-25T07:27:18Z

Thank you!

sailfish009 · 2024-10-29T10:55:57Z

one thing i would like to share:

### After finetuning
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration  # llava-onevision
from transformers import AutoProcessor, LlavaForConditionalGeneration           # llava

zjysteven pinned this issue Oct 17, 2024

zjysteven mentioned this issue Oct 18, 2024

Why does the loss always be 0 when fine-tuning llava-onevision? #45

Closed

zjysteven changed the title ~~A caveat of model_max_length when using LLaVA-Onevision~~ [LLaVA-Onevision] loss being 0 is a caveat of model_max_length Oct 18, 2024

This was referenced Oct 18, 2024

fine-tuning the mmprojector #39

Closed

Long Output After Finetuning #47

Open

zjysteven mentioned this issue Nov 24, 2024

Loss and grad.norm are 0? #55

Closed

zjysteven changed the title ~~[LLaVA-Onevision] loss being 0 is a caveat of model_max_length~~ loss being 0 is a caveat of model_max_length Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss being 0 is a caveat of `model_max_length` #43

loss being 0 is a caveat of `model_max_length` #43

zjysteven commented Oct 17, 2024 •

edited

Loading

98986oiuoy commented Oct 18, 2024

sunfanyunn commented Oct 24, 2024

zjysteven commented Oct 24, 2024

sunfanyunn commented Oct 24, 2024

zjysteven commented Oct 24, 2024 •

edited

Loading

sunfanyunn commented Oct 25, 2024 •

edited

Loading

zjysteven commented Oct 25, 2024 •

edited

Loading

zucchini-nlp commented Oct 25, 2024

sunfanyunn commented Oct 25, 2024

sailfish009 commented Oct 29, 2024

loss being 0 is a caveat of model_max_length #43

loss being 0 is a caveat of model_max_length #43

Comments

zjysteven commented Oct 17, 2024 • edited Loading

98986oiuoy commented Oct 18, 2024

sunfanyunn commented Oct 24, 2024

zjysteven commented Oct 24, 2024

sunfanyunn commented Oct 24, 2024

zjysteven commented Oct 24, 2024 • edited Loading

sunfanyunn commented Oct 25, 2024 • edited Loading

zjysteven commented Oct 25, 2024 • edited Loading

zucchini-nlp commented Oct 25, 2024

sunfanyunn commented Oct 25, 2024

sailfish009 commented Oct 29, 2024

loss being 0 is a caveat of `model_max_length` #43

loss being 0 is a caveat of `model_max_length` #43

zjysteven commented Oct 17, 2024 •

edited

Loading

zjysteven commented Oct 24, 2024 •

edited

Loading

sunfanyunn commented Oct 25, 2024 •

edited

Loading

zjysteven commented Oct 25, 2024 •

edited

Loading