Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable pipe.vae.enable_tiling leads to RuntimeError: Calculated padded input size per channel #561

Open
1 of 2 tasks
liming-ai opened this issue Nov 28, 2024 · 1 comment
Assignees

Comments

@liming-ai
Copy link

System Info / 系統信息

Torch: 2.1.0
CUDA: 12.2
diffusers: 0.32.0.dev0

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Thanks for your contributions and efforts!

I am using a single H100 to run inference, when I turn off all the diffusers optimization:

# pipe.enable_sequential_cpu_offload()
# pipe.vae.enable_tiling()
# pipe.vae.enable_slicing()

or disable pipe.vae.enable_tiling(), there is an error:

Traceback (most recent call last):
  File "/home/tiger/code/run.py", line 16, in <module>
    video = pipe(
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py", line 776, in __call__
    latents, image_latents = self.prepare_latents(
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py", line 381, in prepare_latents
    image_latents = [retrieve_latents(self.vae.encode(img.unsqueeze(0)), generator) for img in image]
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py", line 381, in <listcomp>
    image_latents = [retrieve_latents(self.vae.encode(img.unsqueeze(0)), generator) for img in image]
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1232, in encode
    h = self._encode(x)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1204, in _encode
    x_intermediate, conv_cache = self.encoder(x_intermediate, conv_cache=conv_cache)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 807, in forward
    hidden_states, new_conv_cache[conv_cache_key] = down_block(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 439, in forward
    hidden_states, new_conv_cache[conv_cache_key] = resnet(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 304, in forward
    hidden_states, new_conv_cache["conv1"] = self.conv1(hidden_states, conv_cache=conv_cache.get("conv1"))
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 144, in forward
    output = self.conv(inputs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 62, in forward
    output_chunks.append(super().forward(input_chunk))
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/conv.py", line 610, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/conv.py", line 605, in _conv_forward
    return F.conv3d(
RuntimeError: Calculated padded input size per channel: (1 x 2402 x 2402). Kernel size: (3 x 3 x 3). Kernel size can't be greater than actual input size

If I turn off cpu_offload() or pipe.vae.enable_slicing(), the code can run successfully, about 2h35m to generate a video

# pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
# pipe.vae.enable_slicing()

The full code is here:

import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
image = load_image(image="image.webp")  # 1024x1024
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B-I2V",
    torch_dtype=torch.bfloat16,
).to("cuda")

# pipe.enable_sequential_cpu_offload()
# pipe.vae.enable_tiling()
# pipe.vae.enable_slicing()

video = pipe(
    prompt=prompt,
    image=image,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Expected behavior / 期待表现

Hopefully we can disable pipe.vae.enable_tiling() and run successfully

@liming-ai liming-ai changed the title pipe.vae.enable_tiling leads to RuntimeError: Calculated padded input size per channel Disable pipe.vae.enable_tiling leads to RuntimeError: Calculated padded input size per channel Nov 28, 2024
@zRzRzRzRzRzRzR
Copy link
Member

For the CogVideoX1.5b model, you should set the resolution and set the frame rate to 81, set the resolution such as height to 768 and width to 1360

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants