Parallel Inference with xDit unsuccessful #129

BestKuan · 2024-12-16T06:45:21Z

Hello, I have a problem. I can't successfully run Parallel Inference in an environment equipped with 8 L40S GPU cards (each card having 48GB of VRAM). The run fails with a memory insufficient error on rank 0. However, single-card operation runs successfully, although it takes significantly longer.

feifeibear · 2024-12-16T08:44:26Z

We will check this issue ASAP.

ximo2002 · 2024-12-16T08:56:17Z

(HunyuanVideo) root@dd22:~/project/HunyuanVideo# torchrun --nproc_per_node=8 sample_video.py --video-size 1280 720 --video-length 129 --infer-steps 50 --prompt "A cat walks on the grass, realistic style." --flow-reverse --seed 42 --ulysses-degree 8 --ring-degree 1 --save-path ./results
W1216 08:53:53.827000 140297779558208 torch/distributed/run.py:779]
W1216 08:53:53.827000 140297779558208 torch/distributed/run.py:779] *****
W1216 08:53:53.827000 140297779558208 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1216 08:53:53.827000 140297779558208 torch/distributed/run.py:779] *
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.565 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=1 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.586 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
2024-12-16 08:53:56.586 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=0 local_rank=-1 distributed_init_method=env:// backend=nccl
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=4 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.606 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=5 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.610 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=7 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.636 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=6 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.738 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=2 local_rank=-1 distributed_init_method=env:// backend=nccl
Namespace(model='HYVideo-T/2-cfgdistill', latent_channels=16, precision='bf16', rope_theta=256, vae='884-16c-hy', vae_precision='fp16', vae_tiling=True, text_encoder='llm', text_encoder_precision='fp16', text_states_dim=4096, text_len=256, tokenizer='llm', prompt_template='dit-llm-encode', prompt_template_video='dit-llm-encode-video', hidden_state_skip_layer=2, apply_final_norm=False, text_encoder_2='clipL', text_encoder_precision_2='fp16', text_states_dim_2=768, tokenizer_2='clipL', text_len_2=77, denoise_type='flow', flow_shift=7.0, flow_reverse=True, flow_solver='euler', use_linear_quadratic_schedule=False, linear_schedule_end=25, model_base='ckpts', dit_weight='ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt', model_resolution='540p', load_key='module', use_cpu_offload=False, batch_size=1, infer_steps=50, disable_autocast=False, save_path='./results', save_path_suffix='', name_suffix='', num_videos=1, video_size=[1280, 720], video_length=129, prompt='A cat walks on the grass, realistic style.', seed_type='auto', seed=42, neg_prompt=None, cfg_scale=1.0, embedded_cfg_scale=6.0, reproduce=False, ulysses_degree=8, ring_degree=1)
2024-12-16 08:53:56.850 | INFO | hyvideo.inference:from_pretrained:153 - Got text-to-video model root path: ckpts
DEBUG 12-16 08:53:56 [parallel_state.py:179] world_size=8 rank=3 local_rank=-1 distributed_init_method=env:// backend=nccl
2024-12-16 08:54:02.117 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.117 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.136 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.167 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.168 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.168 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.168 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
2024-12-16 08:54:02.168 | INFO | hyvideo.inference:from_pretrained:188 - Building model...
[rank3]: Traceback (most recent call last):
[rank3]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank3]: main()
[rank3]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank3]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank3]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank3]: model = load_model(
[rank3]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank3]: model = HYVideoDiffusionTransformer(
[rank3]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank3]: init(self, args, init_kwargs)
[rank3]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank3]: [
[rank3]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank3]: MMSingleStreamBlock(
[rank3]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank3]: self.linear2 = nn.Linear(
[rank3]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank3]: self.weight = Parameter(torch.empty((out_features, in_features), factory_kwargs))
[rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 3 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank1]: main()
[rank1]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank1]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank1]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank1]: model = load_model(
[rank1]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank1]: model = HYVideoDiffusionTransformer(
[rank1]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank1]: init(self, args, init_kwargs)
[rank1]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank1]: [
[rank1]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank1]: MMSingleStreamBlock(
[rank1]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank1]: self.linear2 = nn.Linear(
[rank1]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank1]: self.weight = Parameter(torch.empty((out_features, in_features), factory_kwargs))
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 1 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank2]: main()
[rank2]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank2]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank2]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank2]: model = load_model(
[rank2]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank2]: model = HYVideoDiffusionTransformer(
[rank2]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank2]: init(self, args, init_kwargs)
[rank2]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank2]: [
[rank2]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank2]: MMSingleStreamBlock(
[rank2]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank2]: self.linear2 = nn.Linear(
[rank2]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank2]: self.weight = Parameter(torch.empty((out_features, in_features), factory_kwargs))
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 2 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank0]: main()
[rank0]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank0]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank0]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank0]: model = load_model(
[rank0]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank0]: model = HYVideoDiffusionTransformer(
[rank0]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank0]: init(self, args, init_kwargs)
[rank0]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank0]: [
[rank0]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank0]: MMSingleStreamBlock(
[rank0]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank0]: self.linear2 = nn.Linear(
[rank0]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank0]: self.weight = Parameter(torch.empty((out_features, in_features), factory_kwargs))
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank5]: Traceback (most recent call last):
[rank5]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank5]: main()
[rank5]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank5]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank5]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank5]: model = load_model(
[rank5]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank5]: model = HYVideoDiffusionTransformer(
[rank5]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank5]: init(self, args, init_kwargs)
[rank5]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank5]: [
[rank5]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank5]: MMSingleStreamBlock(
[rank5]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank5]: self.linear2 = nn.Linear(
[rank5]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank5]: self.weight = Parameter(torch.empty((out_features, in_features), factory_kwargs))
[rank5]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 5 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank6]: Traceback (most recent call last):
[rank6]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank6]: main()
[rank6]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank6]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank6]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank6]: model = load_model(
[rank6]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank6]: model = HYVideoDiffusionTransformer(
[rank6]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank6]: init(self, args, init_kwargs)
[rank6]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank6]: [
[rank6]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank6]: MMSingleStreamBlock(
[rank6]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank6]: self.linear2 = nn.Linear(
[rank6]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank6]: self.weight = Parameter(torch.empty((out_features, in_features), factory_kwargs))
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 6 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank7]: Traceback (most recent call last):
[rank7]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank7]: main()
[rank7]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank7]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank7]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank7]: model = load_model(
[rank7]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank7]: model = HYVideoDiffusionTransformer(
[rank7]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank7]: init(self, args, init_kwargs)
[rank7]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank7]: [
[rank7]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank7]: MMSingleStreamBlock(
[rank7]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank7]: self.linear2 = nn.Linear(
[rank7]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank7]: self.weight = Parameter(torch.empty((out_features, in_features), factory_kwargs))
[rank7]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 7 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank4]: Traceback (most recent call last):
[rank4]: File "/root/project/HunyuanVideo/sample_video.py", line 58, in
[rank4]: main()
[rank4]: File "/root/project/HunyuanVideo/sample_video.py", line 25, in main
[rank4]: hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)
[rank4]: File "/root/project/HunyuanVideo/hyvideo/inference.py", line 193, in from_pretrained
[rank4]: model = load_model(
[rank4]: File "/root/project/HunyuanVideo/hyvideo/modules/init.py", line 17, in load_model
[rank4]: model = HYVideoDiffusionTransformer(
[rank4]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
[rank4]: init(self, args, init_kwargs)
[rank4]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 561, in init
[rank4]: [
[rank4]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 562, in
[rank4]: MMSingleStreamBlock(
[rank4]: File "/root/project/HunyuanVideo/hyvideo/modules/models.py", line 291, in init
[rank4]: self.linear2 = nn.Linear(
[rank4]: File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 99, in init
[rank4]: self.weight = Parameter(torch.empty((out_features, in_features), factory_kwargs))
[rank4]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 4 has a total capacity of 23.64 GiB of which 6.81 MiB is free. Including non-PyTorch memory, this process has 23.63 GiB memory in use. Of the allocated memory 23.18 GiB is allocated by PyTorch, and 15.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W1216 08:54:04.627000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879711 closing signal SIGTERM
W1216 08:54:04.628000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879712 closing signal SIGTERM
W1216 08:54:04.628000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879713 closing signal SIGTERM
W1216 08:54:04.628000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879715 closing signal SIGTERM
W1216 08:54:04.628000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879716 closing signal SIGTERM
W1216 08:54:04.628000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879717 closing signal SIGTERM
W1216 08:54:04.629000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2879718 closing signal SIGTERM
E1216 08:54:05.550000 140297779558208 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 3 (pid: 2879714) of binary: /root/miniconda3/envs/HunyuanVideo/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/HunyuanVideo/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/HunyuanVideo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

sample_video.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-16_08:54:04
host : dd22
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2879714)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(HunyuanVideo) root@dd22:~/project/HunyuanVideo#
8卡4090也遇到了这个多卡运行的错误，说显存不足

feifeibear · 2024-12-16T09:28:00Z

@BestKuan @ximo2002 could provide the script you run parallel version. What is the resolution of the vidoe.

ximo2002 · 2024-12-16T09:30:31Z

可以提供您运行的脚本并行版本。视频的分辨率是多少。

torchrun --nproc_per_node=8 sample_video.py
--video-size 1280 720
--video-length 129
--infer-steps 50
--prompt "A cat walks on the grass, realistic style."
--flow-reverse
--seed 42
--ulysses-degree 8
--ring-degree 1
--save-path ./results

是最新的版本哈

jash101 · 2024-12-16T10:18:39Z

I'm facing same issue. I'm using g6.12xlarge instance on aws which has 4 L4 GPUs (each card is 24GB vram).

The command I run is

torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results

I also tried 2x2 and 1x4, but getting out of memory error.

feifeibear · 2024-12-16T14:34:35Z

I'm facing same issue. I'm using g6.12xlarge instance on aws which has 4 L4 GPUs (each card is 24GB vram).

The command I run is
torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results
I also tried 2x2 and 1x4, but getting out of memory error.

I suppose you can not run it successfully with 1 GPU? Currently, the VRAM memory usage should be the same as a single GPU version.

ximo2002 · 2024-12-16T14:57:45Z

我遇到了同样的问题。我在 aws 上使用 g6.12xlarge 实例，它有 4 个 L4 GPU（每个卡为 24GB vram）。
我运行的命令是
torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results
我还尝试了 2x2 和 1x4，但出现内存不足错误。
我想你不能用 1 个 GPU 成功运行它？目前，VRAM 内存使用量应与单个 GPU 版本相同。
所以4090可以运行嘛，8卡的话

HenryBao91 · 2024-12-17T02:17:14Z

the same error with 8 x L20, but I can run successfully with single L20

BestKuan · 2024-12-17T04:05:05Z

@feifeibear thanks for your reply！This is my command script

export TOKENIZERS_PARALLELISM=false

export NPROC_PER_NODE=4
export ULYSSES_DEGREE=2
export RING_DEGREE=2
export CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --nproc_per_node=$NPROC_PER_NODE sample_video.py \
	--video-size  544 960 \
	--video-length 129 \
	--infer-steps 50 \
	--prompt "A baby walks on the grass, realistic style." \
	--seed 42 \
	--embedded-cfg-scale 6.0 \
	--flow-shift 7.0 \
	--flow-reverse \
	--ulysses-degree=$ULYSSES_DEGREE \
	--ring-degree=$RING_DEGREE \
	--save-path ./results

and here is my log file
log_4cards.txt
I can run successfully on a single card with the same video-size and video-length.

jash101 · 2024-12-17T05:37:45Z

I'm facing same issue. I'm using g6.12xlarge instance on aws which has 4 L4 GPUs (each card is 24GB vram).
The command I run is
torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results
I also tried 2x2 and 1x4, but getting out of memory error.
I suppose you can not run it successfully with 1 GPU? Currently, the VRAM memory usage should be the same as a single GPU version.

@feifeibear thanks for your reply. I changed to g6e.12xlarge (4x 48GB), and while I'm able to run single GPU inference for 544 * 960, I'm unable to run parallel inference

Jeff123z · 2024-12-17T08:32:46Z

我是直接git clone https://github.com/tencent/HunyuanVideo，读了一下代码。比较好奇的是parallel inference的实现。
sample_video.py：核心代码
def main():
args = parse_args()
print(args)
models_root_path = Path(args.model_base)
models_root_path ="/home/sw4sever3/xxxx/hunyuan"
# Create save folder to save the samples
save_path = args.save_path if args.save_path_suffix=="" else f'{args.save_path}_{args.save_path_suffix}'
if not os.path.exists(args.save_path):
os.makedirs(save_path, exist_ok=True)

# Load models
hunyuan_video_sampler = HunyuanVideoSampler.from_pretrained(models_root_path, args=args)

##################这一行就是加载模型

# Get the updated args
args = hunyuan_video_sampler.args

# Start sampling
# TODO: batch inference check
outputs = hunyuan_video_sampler.predict()

.......

然后看HunyuanVideoSampler.from_pretrained（）具体实现，

def from_pretrained(cls, pretrained_model_path, args, device=None, **kwargs):

    # ==================== Initialize Distributed Environment ================
    if args.ulysses_degree > 1 or args.ring_degree > 1:

        init_distributed_environment(rank=dist.get_rank(), world_size=dist.get_world_size())
        
        initialize_model_parallel(
            sequence_parallel_degree=dist.get_world_size(),
            ring_degree=args.ring_degree,
            ulysses_degree=args.ulysses_degree,
        )
        device = torch.device(f"cuda:{os.environ['LOCAL_RANK']}")
        
    else:
        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"

    parallel_args = {"ulysses_degree": args.ulysses_degree, "ring_degree": args.ring_degree}

    # ======================== Get the args path =============================

    # Disable gradient
    torch.set_grad_enabled(False)

    # =========================== Build main model ===========================
    logger.info("Building model...")
    
    #initialize_megatron_env()

    factor_kwargs = {"device": device, "dtype": PRECISION_TO_TYPE[args.precision]}
   
    in_channels = args.latent_channels
    out_channels = args.latent_channels

    model = load_model(
        args,
        in_channels=in_channels,
        out_channels=out_channels,
        factor_kwargs=factor_kwargs,
    ) **#################################**#load_model()本质上是HYVideoDiffusionTransformer的构造函数， 我是没有看到它是怎么根据你的启动script设置的--ulysses-degree       --ring-degree  来切分模型从而实现在模型加载的阶段就分布式。我感觉这一行还是把整个模型都加载到单个的GPU上， 这样就会有显存不足。****
   
    
    model = model.to(device)  ####this is original      #####directly move to single GPU, OOM error!  
    model = Inference.load_state_dict(args, model, pretrained_model_path)
    model.eval()
    
    # ============================= Build extra models ========================
    ........
    return model

xibosun · 2024-12-19T02:41:15Z

I'm facing same issue. I'm using g6.12xlarge instance on aws which has 4 L4 GPUs (each card is 24GB vram).
The command I run is
torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results
I also tried 2x2 and 1x4, but getting out of memory error.
I suppose you can not run it successfully with 1 GPU? Currently, the VRAM memory usage should be the same as a single GPU version.
@feifeibear thanks for your reply. I changed to g6e.12xlarge (4x 48GB), and while I'm able to run single GPU inference for 544 * 960, I'm unable to run parallel inference

@jash101 Did you run single GPU inference with the --use-cpu-offload flag? as I'm not able to run single GPU inference when CPU offload is disabled.

jash101 · 2024-12-19T09:28:42Z

I'm facing same issue. I'm using g6.12xlarge instance on aws which has 4 L4 GPUs (each card is 24GB vram).
The command I run is
torchrun --nproc_per_node=4 sample_video.py \
    --video-size 1280 720 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --seed 42 \
    --ulysses-degree 4 \
    --ring-degree 1 \
    --save-path ./results
I also tried 2x2 and 1x4, but getting out of memory error.
I suppose you can not run it successfully with 1 GPU? Currently, the VRAM memory usage should be the same as a single GPU version.
@feifeibear thanks for your reply. I changed to g6e.12xlarge (4x 48GB), and while I'm able to run single GPU inference for 544 * 960, I'm unable to run parallel inference
@jash101 Did you run single GPU inference with the --use-cpu-offload flag? as I'm not able to run single GPU inference when CPU offload is disabled.

@xibosun yes, I used command in README for single GPU inference:

python3 sample_video.py \
    --video-size 720 1280 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --use-cpu-offload \
    --save-path ./results

xibosun · 2024-12-19T10:53:13Z

The OOM issue arises due to the absence of CPU offloading support in multi-GPU inference. So it's natural for multi-GPU inference to consume more GPU memory compared to single-GPU setups.

Nevertheless, we are actively exploring alternative strategies such as FSDP to mitigate memory demands during multi-GPU inference.

jash101 · 2024-12-20T01:59:22Z

The OOM issue arises due to the absence of CPU offloading support in multi-GPU inference. So it's natural for multi-GPU inference to consume more GPU memory compared to single-GPU setups.

Nevertheless, we are actively exploring alternative strategies such as FSDP to mitigate memory demands during multi-GPU inference.

Thanks for pointing this out, yes, that makes sense. Tested without cpu offload on single GPU and it gives OOM error.
Tested with a smaller resolution and it runs on both single gpu and parallel, so no issue really with the repo.
Thanks

zhoudaquan closed this as completed Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Inference with xDit unsuccessful #129

Parallel Inference with xDit unsuccessful #129

BestKuan commented Dec 16, 2024 •

edited

Loading

feifeibear commented Dec 16, 2024

ximo2002 commented Dec 16, 2024

feifeibear commented Dec 16, 2024

ximo2002 commented Dec 16, 2024

jash101 commented Dec 16, 2024

feifeibear commented Dec 16, 2024

ximo2002 commented Dec 16, 2024

HenryBao91 commented Dec 17, 2024

BestKuan commented Dec 17, 2024 •

edited

Loading

jash101 commented Dec 17, 2024

Jeff123z commented Dec 17, 2024

xibosun commented Dec 19, 2024

jash101 commented Dec 19, 2024

xibosun commented Dec 19, 2024 •

edited

Loading

jash101 commented Dec 20, 2024

Parallel Inference with xDit unsuccessful #129

Parallel Inference with xDit unsuccessful #129

Comments

BestKuan commented Dec 16, 2024 • edited Loading

feifeibear commented Dec 16, 2024

ximo2002 commented Dec 16, 2024

sample_video.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-12-16_08:54:04 host : dd22 rank : 3 (local_rank: 3) exitcode : 1 (pid: 2879714) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

feifeibear commented Dec 16, 2024

ximo2002 commented Dec 16, 2024

jash101 commented Dec 16, 2024

feifeibear commented Dec 16, 2024

ximo2002 commented Dec 16, 2024

HenryBao91 commented Dec 17, 2024

BestKuan commented Dec 17, 2024 • edited Loading

jash101 commented Dec 17, 2024

Jeff123z commented Dec 17, 2024

xibosun commented Dec 19, 2024

jash101 commented Dec 19, 2024

xibosun commented Dec 19, 2024 • edited Loading

jash101 commented Dec 20, 2024

BestKuan commented Dec 16, 2024 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-16_08:54:04
host : dd22
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2879714)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

BestKuan commented Dec 17, 2024 •

edited

Loading

xibosun commented Dec 19, 2024 •

edited

Loading