Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CogX fails on MacOS requesting a 10TB buffer. #9972

Open
Vargol opened this issue Nov 20, 2024 · 7 comments
Open

CogX fails on MacOS requesting a 10TB buffer. #9972

Vargol opened this issue Nov 20, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@Vargol
Copy link

Vargol commented Nov 20, 2024

Describe the bug

Tried to run the THUDM/CogVideoX1.5-5B model using Diffusers from git (20th Nov, approx 8:30am GMT)
The script failed with

    hidden_states = F.scaled_dot_product_attention(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid buffer size: 10973.48 GB

While these are big models, I suspect that 10TB of Ram is not being used by the CUDA users out there :-)

Reproduction

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

torch.mps.set_per_process_memory_fraction(0.0)

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B",
    torch_dtype=torch.bfloat16
).to("mps")


#pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=81,
    guidance_scale=6,
    generator=torch.Generator(device="mps").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Logs

The full output was

$ python cogx.py 
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 4/4 [00:37<00:00,  9.25s/it]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████| 5/5 [00:39<00:00,  7.91s/it]
  0%|                                                                                            | 0/50 [00:18<?, ?it/s]
Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/cog/cogx.py", line 19, in <module>
    video = pipe(
            ^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 710, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 503, in forward
    hidden_states, encoder_hidden_states = block(
                                           ^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 132, in forward
    attn_hidden_states, attn_encoder_hidden_states = self.attn1(
                                                     ^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/diffusers/models/attention_processor.py", line 530, in forward
    return self.processor(
           ^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/diffusers/models/attention_processor.py", line 2297, in __call__
    hidden_states = F.scaled_dot_product_attention(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid buffer size: 10973.48 GB


### System Info

- 🤗 Diffusers version: 0.32.0.dev0
- Platform: macOS-15.1.1-arm64-arm-64bit
- Running on Google Colab?: No
- Python version: 3.11.10
- PyTorch version (GPU?): 2.6.0.dev20241115 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.26.2
- Transformers version: 4.46.2
- Accelerate version: 1.1.1
- PEFT version: not installed
- Bitsandbytes version: not installed
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: Apple M3
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No


### Who can help?

@pcuenca
@Vargol Vargol added the bug Something isn't working label Nov 20, 2024
@sayakpaul
Copy link
Member

I am not sure if this is a diffusers-specific problem, though. My instincts tell me that if you generate random tensors matching what CogVideoX has and do a computation with F.scaled_dot_product_attention() it would fail.

@Vargol
Copy link
Author

Vargol commented Nov 20, 2024

Loading pipeline components...: 100%|██████████████████████| 5/5 [00:38<00:00,  7.71s/it]
  0%|                                                                                            | 0/50 [00:00<?, ?it/s]
QUERY: torch.Size([2, 48, 247726, 64])
KEY: torch.Size([2, 48, 247726, 64])
VALUE: torch.Size([2, 48, 247726, 64])
ATTENTION_MASK: None
  0%|                                                                                            | 0/50 [00:12<?, ?it/s]
Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/cog/cogx.py", line 19, in <module>
    video = pipe(

I have no idea if values of that shape going into F.scaled... make sense.

@sayakpaul
Copy link
Member

Your initial error logs suggest that it gets stuck at F.scaled_dot_product_attention().

@a-r-r-o-w
Copy link
Member

a-r-r-o-w commented Nov 20, 2024

That is definitely way too big. Could you explicitly specify height=768 and width=1360? If you don't, the sample_height and sample_width from transformer config are used to calculate the defaults (but also required to figure out RoPE dimensions of 300x300 correctly) which won't work as expected giving you 2400x2400 resolution

@a-r-r-o-w
Copy link
Member

We have something planned that should hopefully reduce memory requirements on Mac, and other devices, coming very soon, that should also be easy to use API-wise. Would really be awesome if you would like to help us test it (I can ping you when the PR is out).

cc @DN6 as Mac devices are good potential candidate for testing out our SplitInferenceModule hooks

@Vargol
Copy link
Author

Vargol commented Nov 20, 2024

Thats an improvementm but it still wants a buffer that 364Gb

python cogx.py 
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████Loading pipeline components...: 100%|███████████████████████████████| 5/5 [00:35<00:00,  7.18s/it]
  0%|                                                                                            | 0/50 [00:00<?, ?it/s]
QUERY: torch.Size([2, 48, 45106, 64])
KEY: torch.Size([2, 48, 45106, 64])
VALUE: torch.Size([2, 48, 45106, 64])
ATTENTION_MASK: None
  0%|                                                                                            | 0/50 [00:02<?, ?it/s]
Traceback (most recent call last):

...

  File "/Volumes/SSD2TB/AI/cog/lib/python3.11/site-packages/diffusers/models/attention_processor.py", line 2302, in __call__
    hidden_states = F.scaled_dot_product_attention(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid buffer size: 363.81 GB

@a-r-r-o-w
Copy link
Member

This is actually a well known problem on Mac devices. mps lacks efficient kernel implementations for many many different things.

Until the PR I mentioned above is out, I'm unsure if there would be any possibility to easily make this run on Macs. For now, maybe you could run the 1.0 versions at 720 x 480 x 49, which should further lower the buffer size allocation.

I hope I'm not bothering you with the technical details too much, but you can significantly reduce the memory usage if you use a wrapper class to chunk the inference across batch_size and num_heads dimensions. This can serve as a useful example of that: https://github.com/huggingface/diffusers/blame/f6f7afa1d7c6f45f8568c5603b1e6300d4583f04/src/diffusers/pipelines/free_noise_utils.py#L37. I will try to get the easy to use API in asap so that the technical details can be ignored for end-users and it "just works"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants