Is it possible to resume from checkpoint in run_mntp.py? #145

yallk · 2024-09-12T06:01:49Z

I wanted to start over from checkpoint because an issue occurred during mntp learning and it was interrupted.
However, when I resumed learning, I received the following message that there was no index in checkpoint.

[rank0]: ValueError: Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json) in XXX

When I looked into the run_mntp.py source,

    # Initialize our Trainer
    trainer = MNTPTrainer(
        **model=model,**
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
        if training_args.do_eval and not is_torch_tpu_available()
        else None,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics
        if training_args.do_eval and not is_torch_tpu_available()
        else None,
    )

    trainer.add_callback(StopTrainingCallback(custom_args.stop_after_n_steps))

    # Training
    if training_args.do_train:
        checkpoint = None
        if training_args.resume_from_checkpoint is not None:
            checkpoint = training_args.resume_from_checkpoint
        elif last_checkpoint is not None:
            checkpoint = last_checkpoint
        **train_result = trainer.train(resume_from_checkpoint=checkpoint)**
        trainer.save_model()  # Saves the tokenizer too for easy upload
        metrics = train_result.metrics

the source stores LLM in the model variable and peftModel in the model.model variable.

After that, when I look at trainer.train, it takes model as an argument, so it seems to learn all the parameters of LLM+peft. Please let me know if I understand correctly.

If I understand correctly, can I learn only the peft part separately and if there is a way to restart from checkpoint when interrupted?

The text was updated successfully, but these errors were encountered:

vaibhavad · 2024-10-20T23:50:38Z

Hi @yallk,

The script currently does not support resuming from a checkpoint. I remember peft and huggingface's default resuming from checkpoint not being compatible when we were running the experiments. I am not sure if that has changed now. Feel free to raise a PR if you have a solution.

yallk closed this as completed Sep 12, 2024

yallk reopened this Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to resume from checkpoint in run_mntp.py? #145

Is it possible to resume from checkpoint in run_mntp.py? #145

yallk commented Sep 12, 2024 •

edited

Loading

vaibhavad commented Oct 20, 2024

Is it possible to resume from checkpoint in run_mntp.py? #145

Is it possible to resume from checkpoint in run_mntp.py? #145

Comments

yallk commented Sep 12, 2024 • edited Loading

vaibhavad commented Oct 20, 2024

yallk commented Sep 12, 2024 •

edited

Loading