Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to resume from checkpoint in run_mntp.py? #145

Open
yallk opened this issue Sep 12, 2024 · 1 comment
Open

Is it possible to resume from checkpoint in run_mntp.py? #145

yallk opened this issue Sep 12, 2024 · 1 comment

Comments

@yallk
Copy link

yallk commented Sep 12, 2024

I wanted to start over from checkpoint because an issue occurred during mntp learning and it was interrupted.
However, when I resumed learning, I received the following message that there was no index in checkpoint.

[rank0]: ValueError: Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json) in XXX

When I looked into the run_mntp.py source,

    # Initialize our Trainer
    trainer = MNTPTrainer(
        **model=model,**
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
        if training_args.do_eval and not is_torch_tpu_available()
        else None,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics
        if training_args.do_eval and not is_torch_tpu_available()
        else None,
    )

    trainer.add_callback(StopTrainingCallback(custom_args.stop_after_n_steps))

    # Training
    if training_args.do_train:
        checkpoint = None
        if training_args.resume_from_checkpoint is not None:
            checkpoint = training_args.resume_from_checkpoint
        elif last_checkpoint is not None:
            checkpoint = last_checkpoint
        **train_result = trainer.train(resume_from_checkpoint=checkpoint)**
        trainer.save_model()  # Saves the tokenizer too for easy upload
        metrics = train_result.metrics

the source stores LLM in the model variable and peftModel in the model.model variable.

After that, when I look at trainer.train, it takes model as an argument, so it seems to learn all the parameters of LLM+peft. Please let me know if I understand correctly.

If I understand correctly, can I learn only the peft part separately and if there is a way to restart from checkpoint when interrupted?

@yallk yallk closed this as completed Sep 12, 2024
@yallk yallk reopened this Sep 12, 2024
@vaibhavad
Copy link
Collaborator

Hi @yallk,

The script currently does not support resuming from a checkpoint. I remember peft and huggingface's default resuming from checkpoint not being compatible when we were running the experiments. I am not sure if that has changed now. Feel free to raise a PR if you have a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants