We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peft v0.13.2 Transformers v4.44.0 Accelerate v0.33.0
No response
我尝试使用glm4-9b 中的fintune.py 结合peft中的xlora进行模型训练. finetune.py没有做任何的修改, 以下是我的xlora.yaml文件:
data_config: train_file: train.jsonl val_file: dev.jsonl test_file: dev.jsonl num_proc: 1 combine: True freezeV: True max_input_length: 512 max_output_length: 512 training_args: # see `transformers.Seq2SeqTrainingArguments` output_dir: ./output_1026 max_steps: 20000 # needed to be fit for the dataset learning_rate: 3e-4 # settings for data loading per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false # settings for saving checkpoints save_strategy: steps save_steps: 5 # settings for logging log_level: info logging_strategy: steps logging_steps: 5 # settings for evaluation per_device_eval_batch_size: 4 eval_strategy: steps eval_steps: 2000 # settings for optimizer adam_epsilon: 1e-6 # uncomment the following line to detect nan or inf values # debug: underflow_overflow predict_with_generate: true # see `transformers.GenerationConfig` generation_config: max_new_tokens: 512 # set your absolute deepspeed path here # deepspeed: configs/ds_zero_3.json peft_config: peft_type: XLORA task_type: CAUSAL_LM hidden_size : 4096 xlora_depth : 1 adapters : { "adapter_0": "/home/hs/hs/finetune_demo/output_MechanicsMaterials_New/checkpoint-5000/", "adapter_1": "/home/hs/hs/finetune_demo/output_biology/checkpoint-4000/", }
其中的adapter_0和adapter_1是我使用glm4-9b以及finetune.py训练的lora adapter. 目前我在结合xlora训练时候的时候是可以保存checkpoint的, 但是当我在从last checkpoint 恢复训练的时候就会发生报错. 以下是完整的报错信息:
Loading checkpoint shards: 0%| | 0/10 [00:00<?, ?it/s] Loading checkpoint shards: 10%|█ | 1/10 [00:00<00:01, 6.86it/s] Loading checkpoint shards: 20%|██ | 2/10 [00:00<00:01, 6.89it/s] Loading checkpoint shards: 30%|███ | 3/10 [00:00<00:01, 6.90it/s] Loading checkpoint shards: 40%|████ | 4/10 [00:00<00:00, 6.43it/s] Loading checkpoint shards: 50%|█████ | 5/10 [00:00<00:00, 6.61it/s] Loading checkpoint shards: 60%|██████ | 6/10 [00:00<00:00, 6.72it/s] Loading checkpoint shards: 70%|███████ | 7/10 [00:01<00:00, 6.80it/s] Loading checkpoint shards: 80%|████████ | 8/10 [00:01<00:00, 6.85it/s] Loading checkpoint shards: 90%|█████████ | 9/10 [00:01<00:00, 6.88it/s] Loading checkpoint shards: 100%|██████████| 10/10 [00:01<00:00, 6.96it/s] Loading checkpoint shards: 100%|██████████| 10/10 [00:01<00:00, 6.82it/s] 0%| | 0/2 [00:00<?, ?it/s] 50%|█████ | 1/2 [00:06<00:06, 6.33s/it] 100%|██████████| 2/2 [00:06<00:00, 3.22s/it] Froze 160 adapters. LoRA -> xLoRA complete: Swapped 40 LoRA layers (out of 971 modules). trainable params: 67,145,732 || all params: 9,472,667,652 || trainable%: 0.7088 Map: 0%| | 0/14803 [00:00<?, ? examples/s] Map: 7%|▋ | 1000/14803 [00:03<00:42, 327.62 examples/s] Map: 14%|█▎ | 2000/14803 [00:05<00:36, 347.77 examples/s] Map: 20%|██ | 3000/14803 [00:08<00:33, 356.42 examples/s] Map: 27%|██▋ | 4000/14803 [00:11<00:29, 361.64 examples/s] Map: 34%|███▍ | 5000/14803 [00:13<00:26, 363.21 examples/s] Map: 41%|████ | 6000/14803 [00:16<00:24, 363.04 examples/s] Map: 47%|████▋ | 7000/14803 [00:19<00:21, 363.56 examples/s] Map: 54%|█████▍ | 8000/14803 [00:21<00:16, 413.31 examples/s] Map: 61%|██████ | 9000/14803 [00:22<00:11, 504.23 examples/s] Map: 68%|██████▊ | 10000/14803 [00:23<00:08, 597.65 examples/s] Map: 74%|███████▍ | 11000/14803 [00:24<00:05, 681.07 examples/s] Map: 81%|████████ | 12000/14803 [00:25<00:03, 754.44 examples/s] Map: 88%|████████▊ | 13000/14803 [00:26<00:02, 809.81 examples/s] Map: 95%|█████████▍| 14000/14803 [00:27<00:00, 856.43 examples/s] Map: 100%|██████████| 14803/14803 [00:28<00:00, 890.57 examples/s] Map: 100%|██████████| 14803/14803 [00:28<00:00, 528.34 examples/s] train_dataset: Dataset({ features: ['input_ids', 'labels'], num_rows: 14803 }) Map: 0%| | 0/2 [00:00<?, ? examples/s] Map: 100%|██████████| 2/2 [00:00<00:00, 187.78 examples/s] val_dataset: Dataset({ features: ['input_ids', 'output_ids'], num_rows: 2 }) Map: 0%| | 0/2 [00:00<?, ? examples/s] Map: 100%|██████████| 2/2 [00:00<00:00, 189.77 examples/s] test_dataset: Dataset({ features: ['input_ids', 'output_ids'], num_rows: 2 }) Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. max_steps is given, it will override any value given in num_train_epochs resume checkpoint from checkpoint-20 Loading model from ./output_new/checkpoint-20. Multiple active adapters detected will only consider the first adapter [2024-10-15 18:47:30,968] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/transformers/trainer.py:3098: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location) [rank0]: ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ [rank0]: │ /home/zhangjunyi/hs_test/finetune_demo/finetune.py:615 in main │ [rank0]: │ │ [rank0]: │ 612 │ │ │ │ model.enable_input_require_grads() │ [rank0]: │ 613 │ │ │ │ checkpoint_directory = os.path.join(output_dir, "check │ [rank0]: │ 614 │ │ │ │ print("resume checkpoint from checkpoint-" + str(check │ [rank0]: │ ❱ 615 │ │ │ │ trainer.train(resume_from_checkpoint=checkpoint_direct │ [rank0]: │ 616 │ │ │ else: │ [rank0]: │ 617 │ │ │ │ trainer.train() │ [rank0]: │ 618 │ │ else: │ [rank0]: │ │ [rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/transformers/trainer │ [rank0]: │ .py:1938 in train │ [rank0]: │ │ [rank0]: │ 1935 │ │ │ finally: │ [rank0]: │ 1936 │ │ │ │ hf_hub_utils.enable_progress_bars() │ [rank0]: │ 1937 │ │ else: │ [rank0]: │ ❱ 1938 │ │ │ return inner_training_loop( │ [rank0]: │ 1939 │ │ │ │ args=args, │ [rank0]: │ 1940 │ │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ [rank0]: │ 1941 │ │ │ │ trial=trial, │ [rank0]: │ │ [rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/transformers/trainer │ [rank0]: │ .py:2126 in _inner_training_loop │ [rank0]: │ │ [rank0]: │ 2123 │ │ │ │ self._load_from_checkpoint(resume_from_checkpoint, se │ [rank0]: │ 2124 │ │ │ [rank0]: │ 2125 │ │ # Check if saved optimizer or scheduler states exist │ [rank0]: │ ❱ 2126 │ │ self._load_optimizer_and_scheduler(resume_from_checkpoint) │ [rank0]: │ 2127 │ │ │ [rank0]: │ 2128 │ │ # important: at this point: │ [rank0]: │ 2129 │ │ # self.model is the Transformers Model │ [rank0]: │ │ [rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/transformers/trainer │ [rank0]: │ .py:3097 in _load_optimizer_and_scheduler │ [rank0]: │ │ [rank0]: │ 3094 │ │ │ │ │ │ │ **_get_fsdp_ckpt_kwargs(), │ [rank0]: │ 3095 │ │ │ │ │ │ ) │ [rank0]: │ 3096 │ │ │ │ │ else: │ [rank0]: │ ❱ 3097 │ │ │ │ │ │ self.optimizer.load_state_dict( │ [rank0]: │ 3098 │ │ │ │ │ │ │ torch.load(os.path.join(checkpoint, OPTIM │ [rank0]: │ 3099 │ │ │ │ │ │ ) │ [rank0]: │ 3100 │ │ │ │ with warnings.catch_warnings(record=True) as caught_w │ [rank0]: │ │ [rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/accelerate/optimizer │ [rank0]: │ .py:107 in load_state_dict │ [rank0]: │ │ [rank0]: │ 104 │ def load_state_dict(self, state_dict): │ [rank0]: │ 105 │ │ if self.accelerator_state.distributed_type == DistributedType. │ [rank0]: │ 106 │ │ │ xm.send_cpu_data_to_device(state_dict, self.accelerator_st │ [rank0]: │ ❱ 107 │ │ self.optimizer.load_state_dict(state_dict) │ [rank0]: │ 108 │ │ [rank0]: │ 109 │ def state_dict(self): │ [rank0]: │ 110 │ │ return self.optimizer.state_dict() │ [rank0]: │ │ [rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/_compile.py:31 │ [rank0]: │ in inner │ [rank0]: │ │ [rank0]: │ 28 │ │ │ │ disable_fn = torch._dynamo.disable(fn, recursive) │ [rank0]: │ 29 │ │ │ │ fn.__dynamo_disable = disable_fn │ [rank0]: │ 30 │ │ │ │ [rank0]: │ ❱ 31 │ │ │ return disable_fn(*args, **kwargs) │ [rank0]: │ 32 │ │ │ [rank0]: │ 33 │ │ return inner │ [rank0]: │ 34 │ else: │ [rank0]: │ │ [rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/_dynamo/eval_f │ [rank0]: │ rame.py:600 in _fn │ [rank0]: │ │ [rank0]: │ 597 │ │ def _fn(*args, **kwargs): │ [rank0]: │ 598 │ │ │ prior = set_eval_frame(callback) │ [rank0]: │ 599 │ │ │ try: │ [rank0]: │ ❱ 600 │ │ │ │ return fn(*args, **kwargs) │ [rank0]: │ 601 │ │ │ finally: │ [rank0]: │ 602 │ │ │ │ set_eval_frame(prior) │ [rank0]: │ 603 │ [rank0]: │ │ [rank0]: │ /home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/optim/optimize │ [rank0]: │ r.py:854 in load_state_dict │ [rank0]: │ │ [rank0]: │ 851 │ │ param_lens = (len(g["params"]) for g in groups) │ [rank0]: │ 852 │ │ saved_lens = (len(g["params"]) for g in saved_groups) │ [rank0]: │ 853 │ │ if any(p_len != s_len for p_len, s_len in zip(param_lens, sav │ [rank0]: │ ❱ 854 │ │ │ raise ValueError( │ [rank0]: │ 855 │ │ │ │ "loaded state dict contains a parameter group " │ [rank0]: │ 856 │ │ │ │ "that doesn't match the size of optimizer's group" │ [rank0]: │ 857 │ │ │ ) │ [rank0]: ╰──────────────────────────────────────────────────────────────────────────────╯ [rank0]: ValueError: loaded state dict contains a parameter group that doesn't match the [rank0]: size of optimizer's group E1015 18:47:35.719000 139827737793152 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 4075482) of binary: /home/zhangjunyi/anaconda3/bin/python Traceback (most recent call last): File "/home/zhangjunyi/anaconda3/bin/torchrun", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhangjunyi/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================
这个是我在结合xlora时保存的checkpoint
非常感谢任何的指导去解决这个xlora检查点恢复的问题。如果有人遇到过类似的问题,或者对xlora成功启用检查点恢复的具体设置或步骤有见解,那么您的建议将是非常宝贵的。此外,如果任何熟悉xlora的维护者或社区成员能够提供支持,那将非常有帮助。非常感谢!
The text was updated successfully, but these errors were encountered:
先参考https://zhipu-ai.feishu.cn/wiki/QanjwjOuaiWMZ6kdVZfcNZwCnBh?fromScene=spaceOverview试试可以不,命令最后加yes
Sorry, something went wrong.
感谢您的回复, 我就是按照demo中的方法在命令后加了yes, 同时我也尝试过在finetune.py里直接指定trainer.train(resume_from_checkpoint="/home/zhangjunyi/hs_test/finetune_demo/output_new/checkpoint-20"), 这两种方法都无法resume 训练
zhipuch
No branches or pull requests
System Info / 系統信息
Peft v0.13.2
Transformers v4.44.0
Accelerate v0.33.0
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
我尝试使用glm4-9b 中的fintune.py 结合peft中的xlora进行模型训练. finetune.py没有做任何的修改, 以下是我的xlora.yaml文件:
其中的adapter_0和adapter_1是我使用glm4-9b以及finetune.py训练的lora adapter. 目前我在结合xlora训练时候的时候是可以保存checkpoint的, 但是当我在从last checkpoint 恢复训练的时候就会发生报错. 以下是完整的报错信息:
这个是我在结合xlora时保存的checkpoint
Expected behavior / 期待表现
非常感谢任何的指导去解决这个xlora检查点恢复的问题。如果有人遇到过类似的问题,或者对xlora成功启用检查点恢复的具体设置或步骤有见解,那么您的建议将是非常宝贵的。此外,如果任何熟悉xlora的维护者或社区成员能够提供支持,那将非常有帮助。非常感谢!
The text was updated successfully, but these errors were encountered: