Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lora 微调的模型使用--resume_from_checkpoint参数,继续训练报显存不足;不使用--resume_from_checkpoint参数可以正常训练 #2505

Open
xyz515 opened this issue Nov 26, 2024 · 1 comment

Comments

@xyz515
Copy link

xyz515 commented Nov 26, 2024

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
lora微调脚本
如下是lora微调脚本,如果使用 --resume_from_checkpoint 加载微调后的模型继续训练,就会报错。显卡型号:A800-40G显存
如果不使用 --resume_from_checkpoint 参数微调模型,是可以训练的。

nproc_per_node=8
max_length=8300
model_id_or_path=./Qwen2-7B-Instruct
model_type=qwen2-7b-instruct

NPROC_PER_NODE=$nproc_per_node \
MASTER_PORT=29500 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift sft \
    --model_id_or_path $model_id_or_path \
    --model_type $model_type \
    --model_revision master \
    --sft_type lora \
    --tuner_backend peft \
    --template_type AUTO \
    --dtype bf16 \
    --output_dir output \
    --ddp_backend nccl \
    --custom_train_dataset_path train.json \
    --lora_rank 16 \
    --lora_alpha 64 \
    --lora_dropout_p 0.05 \
    --lora_target_modules DEFAULT \
    --custom_val_dataset_path test_2.json \
    --train_dataset_sample -1 \
    --num_train_epochs 1 \
    --max_length $max_length \
    --check_dataset_strategy warning \
    --gradient_checkpointing true \
    --batch_size 1 \
    --weight_decay 0.1 \
    --learning_rate 3e-4 \
    --gradient_accumulation_steps $(expr 128 / $nproc_per_node) \
    --max_grad_norm 0.5 \
    --warmup_ratio 0.03 \
    --eval_steps 1000 \
    --save_steps 1000 \
    --save_total_limit 10 \
    --logging_steps 10 \
    --use_flash_attn true \
    --save_only_model true \
    --lazy_tokenize true \
    --deepspeed 'default-zero2' \
    --resume_from_checkpoint output/qwen2-7b-instruct/v14-20241123-110642/checkpoint-8790 \
    --resume_only_model true \

错误截图
image

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
torch 2.1.2+cu121
ms-swift 2.6.0.post2
transformers 4.42.0
NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.1

Additional context
Add any other context about the problem here(在这里补充其他信息)
在A800-80G的显卡的机器上,可以使用 --resume_from_checkpoint参数来微调模型。只是占用显存比不使用 --resume_from_checkpoint参数的要大一些。

@xyz515 xyz515 changed the title lora 微调的模型,继续训练报显存不足 lora 微调的模型使用--resume_from_checkpoint参数,继续训练报显存不足;不使用--resume_from_checkpoint参数可以正常训练 Nov 26, 2024
@wolfworld6
Copy link

遇到同样的问题,解决了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants