Fast and Memory-Efficient Video Diffusion Using Streamlined Inference

Official Implementation of NeurIPS2024 Fast and Memory-Efficient Video Diffusion Using Streamlined Inference

Fast and Memory-Efficient Video Diffusion Using Streamlined Inference
Zheng Zhan*, Yushu Wu*, Yifan Gong, Zichong Meng, Zhenglun Kong, Changdi Yang, Geng Yuan, Puzhao, Wei Nui, and Yanzhi Wang
Northeastern University, Harvard University, University of Georgia
38th Conference on Neural Information Processing Systems (NeurIPS 2024)

This repo contains simulation of Feature Slicer (Sec.4.1) and Operator Grouping (Sec.4.2) which can effectively reduce the memory-footprint of spatial-temporal model in inference.

Tested Devices

NVIDIA A100-SXM4-80GB
NVIDIA A100-PCIE-40GB
NVIDIA A6000

Supported Pipeline

Stable Video Diffusion
AnimateDiff

Quick Start

Install

git clone https://github.com/wuyushuwys/FMEDiffusion
cd FMEDiffusion
# if you use conda
conda create -n fme python=3.10 -y
conda activate fme

pip install torch==2.4.1 torchvision==0.19.1 --index-url https://download.pytorch.org/whl/cu121
pip install pynvml  # for memory-footprint benchmark
pip install .

Usage

import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_gif
# import our module wrapper
from fme import FMEWrapper

# load pipeline
pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid", torch_dtype=torch.float16, variant="fp16"
)
pipe.to('cuda')

# initialize wrapper
helper = FMEWrapper(num_temporal_chunk=7, num_spatial_chunk=7, num_frames=pipe.unet.config.num_frames)
# wrap pipeline
helper.wrap(pipe)

# Inference as normal
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
# no decode_chunk_size required!
frames = pipe(image, generator=generator).frames[0]

export_to_gif(frames, "generated_fme.gif", fps=7)

Notes

The peak memory values may not exactly match those reported in the paper.

In the case of SVD (num_frames=14, resolution=576x1024), the original peak memory reported in the paper is 39.49 GB, which can be reduced to 23.42 GB using our proposed method. However, using the example, you may observe a peak memory of around 24.49 GB using our method, and note that the original peak memory could also rise to 40.39 GB. These values may differ slightly from those reported in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
scripts		scripts
src/fme		src/fme
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast and Memory-Efficient Video Diffusion Using Streamlined Inference

Tested Devices

Supported Pipeline

Quick Start

Install

Usage

Notes

The peak memory values may not exactly match those reported in the paper.

About

Releases

Packages

Languages

wuyushuwys/FMEDiffusion

Folders and files

Latest commit

History

Repository files navigation

Fast and Memory-Efficient Video Diffusion Using Streamlined Inference

Tested Devices

Supported Pipeline

Quick Start

Install

Usage

Notes

The peak memory values may not exactly match those reported in the paper.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages