You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using a A10 24G GPU to train a LoRA model with video caption data. The model being used is EasyAnimateV5-12b-zh-InP. However, when running the script, you encounter the following error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB. GPU 0 has a total capacity of 22.08 GiB of which 1.50 MiB is free. Process 802290 has 22.07 GiB memory in use. Of the allocated memory 21.82 GiB is allocated by PyTorch, and 8.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
And then, I tried reducing video_sample_size to 64, but the issue persists.
After this, I changed the model to EasyAnimateV5-7b-zh-InP, but the training still failed. The log is as follows:
root@7891e0d6917c:~/workspace/EasyAnimate# sh scripts/train_lora.sh
The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 1 --num_machines was set to a value of 1 --dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
/usr/local/lib/python3.10/dist-packages/albumentations/init.py:24: UserWarning: A new version of Albumentations is available: 1.4.22 (you have 1.4.21). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
check_for_updates()
12/17/2024 01:19:39 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: bf16
Init rng with seed 42. Process_index is 0
The config attributes {'snr_shift_scale': 1.0} were passed to DDPMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
{'variance_type', 'dynamic_thresholding_ratio', 'thresholding'} was not found in config. Values will be initialized to default values.
Init BertTokenizer
Init T5Tokenizer
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.22s/it]
The config attributes {'sample_size': 256} were passed to AutoencoderKLMagvit, but are not expected and will be ignored. Please verify your config.json configuration file.
missing keys: 0;
unexpected keys: 0;
[] []
loaded 3D transformer's pretrained weights from models/Diffusion_Transformer/EasyAnimateV5-7b-zh-InP/transformer ...
missing keys: 0;
unexpected keys: 0;
[]
All Parameters: 6813.097536 M
attn1 Parameters: 1359.40608 M
create LoRA network. base dim (rank): 128, alpha: 64
neuron dropout: p=None
create LoRA for Text Encoder: 144 modules.
create LoRA for U-Net: 402 modules.
enable LoRA for U-Net
12/17/2024 01:21:13 - INFO - root - Add network parameters
loading annotations from /root/workspace/EasyAnimate/easyanimate/video_caption/datasets/panda_70m/train_panda_70m.json ...
data scale: 55
12/17/2024 01:21:40 - INFO - main - ***** Running training *****
12/17/2024 01:21:40 - INFO - main - Num examples = 55
12/17/2024 01:21:40 - INFO - main - Num Epochs = 100
12/17/2024 01:21:40 - INFO - main - Instantaneous batch size per device = 1
12/17/2024 01:21:40 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1
12/17/2024 01:21:40 - INFO - main - Gradient Accumulation steps = 1
12/17/2024 01:21:40 - INFO - main - Total optimization steps = 5500
Steps: 0%| | 0/5500 [00:00<?, ?it/s]xx top 10 [17, 24, 34, 33, 28, 40, 25, 27, 6, 44] 0
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
division by zero
Traceback (most recent call last):
File "/root/workspace/EasyAnimate/scripts/train_lora.py", line 2065, in
main()
File "/root/workspace/EasyAnimate/scripts/train_lora.py", line 1521, in main
pixel_values, texts = batch['pixel_values'].cpu(), batch['text']
TypeError: 'NoneType' object is not subscriptable
Steps: 0%| | 0/5500 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1168, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 763, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'scripts/train_lora.py', '--pretrained_model_name_or_path=models/Diffusion_Transformer/EasyAnimateV5-7b-zh-InP', '--train_data_dir=/root/workspace/EasyAnimate/easyanimate/video_caption/datasets/panda_70m', '--train_data_meta=/root/workspace/EasyAnimate/easyanimate/video_caption/datasets/panda_70m/train_panda_70m.json', '--config_path', 'config/easyanimate_video_v5_magvit_multi_text_encoder.yaml', '--image_sample_size=1024', '--video_sample_size=256', '--token_sample_size=512', '--video_sample_stride=3', '--video_sample_n_frames=49', '--train_batch_size=1', '--video_repeat=1', '--gradient_accumulation_steps=1', '--dataloader_num_workers=8', '--num_train_epochs=100', '--checkpointing_steps=100', '--learning_rate=1e-04', '--seed=42', '--low_vram', '--output_dir=output_dir', '--gradient_checkpointing', '--mixed_precision=bf16', '--adam_weight_decay=5e-3', '--adam_epsilon=1e-10', '--vae_mini_batch=1', '--max_grad_norm=0.05', '--random_hw_adapt', '--training_with_video_token_length', '--not_sigma_loss', '--enable_bucket', '--uniform_sampling', '--train_mode=inpaint']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered:
I am using a A10 24G GPU to train a LoRA model with video caption data. The model being used is EasyAnimateV5-12b-zh-InP. However, when running the script, you encounter the following error:
And then, I tried reducing video_sample_size to 64, but the issue persists.
My train_lora.sh Script:
After this, I changed the model to EasyAnimateV5-7b-zh-InP, but the training still failed. The log is as follows:
The text was updated successfully, but these errors were encountered: