You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to start over from checkpoint because an issue occurred during mntp learning and it was interrupted.
However, when I resumed learning, I received the following message that there was no index in checkpoint.
[rank0]: ValueError: Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json) in XXX
When I looked into the run_mntp.py source,
# Initialize our Trainer
trainer = MNTPTrainer(
**model=model,**
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
if training_args.do_eval and not is_torch_tpu_available()
else None,
preprocess_logits_for_metrics=preprocess_logits_for_metrics
if training_args.do_eval and not is_torch_tpu_available()
else None,
)
trainer.add_callback(StopTrainingCallback(custom_args.stop_after_n_steps))
# Training
if training_args.do_train:
checkpoint = None
if training_args.resume_from_checkpoint is not None:
checkpoint = training_args.resume_from_checkpoint
elif last_checkpoint is not None:
checkpoint = last_checkpoint
**train_result = trainer.train(resume_from_checkpoint=checkpoint)**
trainer.save_model() # Saves the tokenizer too for easy upload
metrics = train_result.metrics
the source stores LLM in the model variable and peftModel in the model.model variable.
After that, when I look at trainer.train, it takes model as an argument, so it seems to learn all the parameters of LLM+peft. Please let me know if I understand correctly.
If I understand correctly, can I learn only the peft part separately and if there is a way to restart from checkpoint when interrupted?
The text was updated successfully, but these errors were encountered:
The script currently does not support resuming from a checkpoint. I remember peft and huggingface's default resuming from checkpoint not being compatible when we were running the experiments. I am not sure if that has changed now. Feel free to raise a PR if you have a solution.
I wanted to start over from checkpoint because an issue occurred during mntp learning and it was interrupted.
However, when I resumed learning, I received the following message that there was no index in checkpoint.
[rank0]: ValueError: Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json) in XXX
When I looked into the run_mntp.py source,
the source stores LLM in the model variable and peftModel in the model.model variable.
After that, when I look at trainer.train, it takes model as an argument, so it seems to learn all the parameters of LLM+peft. Please let me know if I understand correctly.
If I understand correctly, can I learn only the peft part separately and if there is a way to restart from checkpoint when interrupted?
The text was updated successfully, but these errors were encountered: