Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss being 0 is a caveat of model_max_length #43

Open
zjysteven opened this issue Oct 17, 2024 · 10 comments
Open

loss being 0 is a caveat of model_max_length #43

zjysteven opened this issue Oct 17, 2024 · 10 comments

Comments

@zjysteven
Copy link
Owner

zjysteven commented Oct 17, 2024

TL;DR: Set a large enough model_max_length (e.g., 2048, 4096, or even larger) when finetuning, or otherwise you will be likely to see training loss always being 0.

Today we have enabled finetuning of LLaVA-Onevision in lmms-finetune. There is a quite subtle caveat, though, that's worth mentioning.

In earlier versions of transformers (can't remember exactly, but must be some point before 4.45.2), model_max_length only counts the number of text tokens without considering the vision tokens. Taking LLaVA-1.5 as an example, where each image will be translated into 576 tokens when being sent to the LLM. It means that if you set model_max_length to be 128, then with a prompt including an image, your input sequence length will essentially be 128 - 1 + 576 = 703.

Recently transformers implementations start to include the the vision tokens into model_max_length, where you can see it here https://github.com/huggingface/transformers/blob/3f06f95ebe617b192251ef756518690f5bc7ff76/src/transformers/models/llava/processing_llava.py#L143-L164. Such processing requires some arguments/keywords from the processor's config, which, as of Oct 16, hasn't been updated for LLaVA-1.5/1.6/Interleave/Next-Video. The latest LLaVA-Onevision, however, is fully compatible with this new change, which means that model_max_length will include all vision tokens. As a result, remember to set a large enough model_max_length when finetuning LLaVA-Onevision every model, or otherwise you probably will see loss being 0 all the time as all input tokens could be vision tokens.

I hope that I have made this clear enough, but feel free to leave questions if there are any.

@zjysteven zjysteven pinned this issue Oct 17, 2024
@98986oiuoy
Copy link

The comment is so helpful!

@zjysteven zjysteven changed the title A caveat of model_max_length when using LLaVA-Onevision [LLaVA-Onevision] loss being 0 is a caveat of model_max_length Oct 18, 2024
@sunfanyunn
Copy link

I'm still getting a loss of 0 after --model_max_length to 4096 or more (only with llava-onevision). Are there other reasons that could be causing this?

@zjysteven
Copy link
Owner Author

@sunfanyunn model_max_length might still be small w.r.t. your input. You can use this script to examine the output of collator to see if model_max_length is large enough.

import json
import os
from tqdm import tqdm

import torch
torch.set_printoptions(profile="full", linewidth=240)
from torch.utils.data import DataLoader
from transformers import AutoProcessor, AutoTokenizer

from datasets import LazySupervisedDataset
from collators import COLLATORS
from loaders import LOADERS
from supported_models import MODEL_HF_PATH

model_id = "llava-onevision-0.5b-ov"
model_family_id = "llava-onevision"

dataset = LazySupervisedDataset(
    data_path='./example_data/single_image.json', # use your own data here
    image_folder='./example_data/images',
    video_folder='./example_data/videos',
    model_family_id=model_family_id,
)

_, tokenizer, processor, config = LOADERS[model_family_id](
    model_hf_path=MODEL_HF_PATH[model_id],
    model_local_path=MODEL_HF_PATH[model_id],
    compute_dtype=torch.float16,
).load(load_model=False)
tokenizer.model_max_length = 4096
collator = COLLATORS[model_family_id](
    config=config,
    processor=processor,
    tokenizer=tokenizer
)

dataloader = DataLoader(dataset, batch_size=2, collate_fn=collator)

batch = next(iter(dataloader))
print(batch["input_ids"])
print()
print(batch["labels"])
print()
print(tokenizer.batch_decode(batch["input_ids"], skip_special_tokens=False))
print(tokenizer.decode(
    batch["labels"][1][torch.where(batch["labels"][1] != -100)[0]], skip_special_tokens=True
))

@sunfanyunn
Copy link

image image

I realized my input images (of size 1080 x 1080) are being tokenized into 7371 tokens. @zjysteven am I missing anything obvious?

@zjysteven
Copy link
Owner Author

zjysteven commented Oct 24, 2024

image
This is from the llava onevision paper, which shows that the maximum number of tokens for one image is 7290. Although yours is slightly more (probably with some marking tokens like newline tokens), I don't think there is anything wrong. This is what I meant earlier that your model_max_length may not be large enough.

@sunfanyunn
Copy link

sunfanyunn commented Oct 25, 2024

I am aware, thank you! But, I think my images are represented in a single-image way even when I provide multiple images

@zjysteven
Copy link
Owner Author

zjysteven commented Oct 25, 2024

Oh I see the point now. I briefly browsed through the huggingface's preprocessing code but didn't notice a point where it's distinguished into "single image" and "multi-image" with different preprocessing; it seems to me that currently all images are processed with "anyres" which results in what you saw here.

Meanwhile I do see that in official implementation's training https://github.com/LLaVA-VL/LLaVA-NeXT/blob/79ef45a6d8b89b92d7a8525f077c3a3a9894a87d/llava/train/train.py#L1140-L1148 single-image and multi-images are distinguished in training.

        if "image" in sources[0]:
            image_file = self.list_data_dict[i]["image"]
            if type(image_file) is list:
                image = [self.process_image(f) for f in image_file]
                # Handling multi images
                # overwrite to process with simple pad 
                if len(image_file) > 1:
                    image = [self.process_image(f, "pad") for f in image_file]
                    image = [[im[0], im[1], "image"] for im in image]

Tagging @zucchini-nlp to see if she can kindly confirm this and has any idea.

@zucchini-nlp
Copy link

Hey all!

Yes, you're right, currently HF implementation doesn't distinguish between single vs mutii-image setting. AFAIR inference in orig impl also did not, unless I am missing anything as many things changed in the course of porting the model. I can check out later the inference in original repo

I am sorry HF doesn't support training same way as in the paper. The reason is that I tried to reduce complexity for the model, and aimed at inference-first use cases. We can add single vs multi image difference but it would require padding/unpadding similar to Mllama in that case adding more code to work with.

Let's me how the inference works in original repo and I'll come back to this issue, right now a bit short of bandwidth

@sunfanyunn
Copy link

Thank you!

@sailfish009
Copy link

one thing i would like to share:

### After finetuning
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration  # llava-onevision
from transformers import AutoProcessor, LlavaForConditionalGeneration           # llava

@zjysteven zjysteven changed the title [LLaVA-Onevision] loss being 0 is a caveat of model_max_length loss being 0 is a caveat of model_max_length Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants