lora 微调的模型使用--resume_from_checkpoint参数，继续训练报显存不足；不使用--resume_from_checkpoint参数可以正常训练 #2505

xyz515 · 2024-11-26T04:02:33Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)
lora微调脚本
如下是lora微调脚本，如果使用 --resume_from_checkpoint 加载微调后的模型继续训练，就会报错。显卡型号：A800-40G显存
如果不使用 --resume_from_checkpoint 参数微调模型，是可以训练的。

nproc_per_node=8
max_length=8300
model_id_or_path=./Qwen2-7B-Instruct
model_type=qwen2-7b-instruct

NPROC_PER_NODE=$nproc_per_node \
MASTER_PORT=29500 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
    --model_id_or_path $model_id_or_path \
    --model_type $model_type \
    --model_revision master \
    --sft_type lora \
    --tuner_backend peft \
    --template_type AUTO \
    --dtype bf16 \
    --output_dir output \
    --ddp_backend nccl \
    --custom_train_dataset_path train.json \
    --lora_rank 16 \
    --lora_alpha 64 \
    --lora_dropout_p 0.05 \
    --lora_target_modules DEFAULT \
    --custom_val_dataset_path test_2.json \
    --train_dataset_sample -1 \
    --num_train_epochs 1 \
    --max_length $max_length \
    --check_dataset_strategy warning \
    --gradient_checkpointing true \
    --batch_size 1 \
    --weight_decay 0.1 \
    --learning_rate 3e-4 \
    --gradient_accumulation_steps $(expr 128 / $nproc_per_node) \
    --max_grad_norm 0.5 \
    --warmup_ratio 0.03 \
    --eval_steps 1000 \
    --save_steps 1000 \
    --save_total_limit 10 \
    --logging_steps 10 \
    --use_flash_attn true \
    --save_only_model true \
    --lazy_tokenize true \
    --deepspeed 'default-zero2' \
    --resume_from_checkpoint output/qwen2-7b-instruct/v14-20241123-110642/checkpoint-8790 \
    --resume_only_model true \

错误截图

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)
torch 2.1.2+cu121
ms-swift 2.6.0.post2
transformers 4.42.0
NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.1

Additional context
Add any other context about the problem here(在这里补充其他信息)
在A800-80G的显卡的机器上，可以使用 --resume_from_checkpoint参数来微调模型。只是占用显存比不使用 --resume_from_checkpoint参数的要大一些。

The text was updated successfully, but these errors were encountered:

wolfworld6 · 2024-11-26T08:24:08Z

遇到同样的问题，解决了吗？

xyz515 changed the title ~~lora 微调的模型，继续训练报显存不足~~ lora 微调的模型使用--resume_from_checkpoint参数，继续训练报显存不足；不使用--resume_from_checkpoint参数可以正常训练 Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lora 微调的模型使用--resume_from_checkpoint参数，继续训练报显存不足；不使用--resume_from_checkpoint参数可以正常训练 #2505

lora 微调的模型使用--resume_from_checkpoint参数，继续训练报显存不足；不使用--resume_from_checkpoint参数可以正常训练 #2505

xyz515 commented Nov 26, 2024 •

edited

Loading

wolfworld6 commented Nov 26, 2024

lora 微调的模型使用--resume_from_checkpoint参数，继续训练报显存不足；不使用--resume_from_checkpoint参数可以正常训练 #2505

lora 微调的模型使用--resume_from_checkpoint参数，继续训练报显存不足；不使用--resume_from_checkpoint参数可以正常训练 #2505

Comments

xyz515 commented Nov 26, 2024 • edited Loading

wolfworld6 commented Nov 26, 2024

xyz515 commented Nov 26, 2024 •

edited

Loading