Skquark · Skquark · Jan 26, 2024 · Jan 23, 2024 · Jan 23, 2024 · Jan 23, 2024
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -228,6 +228,8 @@
       title: UNet3DConditionModel
     - local: api/models/unet-motion
       title: UNetMotionModel
+    - local: api/models/uvit2d
+      title: UViT2DModel
     - local: api/models/vq
       title: VQModel
     - local: api/models/autoencoderkl

diff --git a/docs/source/en/api/loaders/single_file.md b/docs/source/en/api/loaders/single_file.md
@@ -30,8 +30,8 @@ To learn more about how to load single file weights, see the [Load different Sta
 
 ## FromOriginalVAEMixin
 
-[[autodoc]] loaders.single_file.FromOriginalVAEMixin
+[[autodoc]] loaders.autoencoder.FromOriginalVAEMixin
 
 ## FromOriginalControlnetMixin
 
-[[autodoc]] loaders.single_file.FromOriginalControlnetMixin
+[[autodoc]] loaders.controlnet.FromOriginalControlNetMixin
diff --git a/docs/source/en/api/models/unet-motion.md b/docs/source/en/api/models/unet-motion.md
@@ -22,4 +22,4 @@ The abstract from the paper is:
 [[autodoc]] UNetMotionModel
 
 ## UNet3DConditionOutput
-[[autodoc]] models.unet_3d_condition.UNet3DConditionOutput
+[[autodoc]] models.unets.unet_3d_condition.UNet3DConditionOutput
diff --git a/docs/source/en/api/models/unet.md b/docs/source/en/api/models/unet.md
@@ -22,4 +22,4 @@ The abstract from the paper is:
 [[autodoc]] UNet1DModel
 
 ## UNet1DOutput
-[[autodoc]] models.unet_1d.UNet1DOutput
+[[autodoc]] models.unets.unet_1d.UNet1DOutput
diff --git a/docs/source/en/api/models/unet2d-cond.md b/docs/source/en/api/models/unet2d-cond.md
@@ -22,10 +22,10 @@ The abstract from the paper is:
 [[autodoc]] UNet2DConditionModel
 
 ## UNet2DConditionOutput
-[[autodoc]] models.unet_2d_condition.UNet2DConditionOutput
+[[autodoc]] models.unets.unet_2d_condition.UNet2DConditionOutput
 
 ## FlaxUNet2DConditionModel
-[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionModel
+[[autodoc]] models.unets.unet_2d_condition_flax.FlaxUNet2DConditionModel
 
 ## FlaxUNet2DConditionOutput
-[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput
+[[autodoc]] models.unets.unet_2d_condition_flax.FlaxUNet2DConditionOutput
diff --git a/docs/source/en/api/models/unet2d.md b/docs/source/en/api/models/unet2d.md
@@ -22,4 +22,4 @@ The abstract from the paper is:
 [[autodoc]] UNet2DModel
 
 ## UNet2DOutput
-[[autodoc]] models.unet_2d.UNet2DOutput
+[[autodoc]] models.unets.unet_2d.UNet2DOutput
diff --git a/docs/source/en/api/models/unet3d-cond.md b/docs/source/en/api/models/unet3d-cond.md
@@ -22,4 +22,4 @@ The abstract from the paper is:
 [[autodoc]] UNet3DConditionModel
 
 ## UNet3DConditionOutput
-[[autodoc]] models.unet_3d_condition.UNet3DConditionOutput
+[[autodoc]] models.unets.unet_3d_condition.UNet3DConditionOutput
diff --git a/docs/source/en/api/models/uvit2d.md b/docs/source/en/api/models/uvit2d.md
@@ -0,0 +1,39 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# UVit2DModel
+
+The [U-ViT](https://hf.co/papers/2301.11093) model is a vision transformer (ViT) based UNet. This model incorporates elements from ViT (considers all inputs such as time, conditions and noisy image patches as tokens) and a UNet (long skip connections between the shallow and deep layers). The skip connection is important for predicting pixel-level features. An additional 3x3 convolutional block is applied prior to the final output to improve image quality.
+
+The abstract from the paper is:
+
+*Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.*
+
+## UVit2DModel
+
+[[autodoc]] UVit2DModel
+
+## UVit2DConvEmbed
+
+[[autodoc]] models.unets.uvit_2d.UVit2DConvEmbed
+
+## UVitBlock
+
+[[autodoc]] models.unets.uvit_2d.UVitBlock
+
+## ConvNextBlock
+
+[[autodoc]] models.unets.uvit_2d.ConvNextBlock
+
+## ConvMlmLayer
+
+[[autodoc]] models.unets.uvit_2d.ConvMlmLayer
diff --git a/docs/source/en/api/pipelines/animatediff.md b/docs/source/en/api/pipelines/animatediff.md
@@ -25,13 +25,16 @@ The abstract of the paper is the following:
 | Pipeline | Tasks | Demo
 |---|---|:---:|
 | [AnimateDiffPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff.py) | *Text-to-Video Generation with AnimateDiff* |
+| [AnimateDiffVideoToVideoPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py) | *Video-to-Video Generation with AnimateDiff* |
 
 ## Available checkpoints
 
 Motion Adapter checkpoints can be found under [guoyww](https://huggingface.co/guoyww/). These checkpoints are meant to work with any model based on Stable Diffusion 1.4/1.5.
 
 ## Usage example
 
+### AnimateDiffPipeline
+
 AnimateDiff works with a MotionAdapter checkpoint and a Stable Diffusion model checkpoint. The MotionAdapter is a collection of Motion Modules that are responsible for adding coherent motion across image frames. These modules are applied after the Resnet and Attention blocks in Stable Diffusion UNet.
 
 The following example demonstrates how to use a *MotionAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5.
@@ -98,6 +101,114 @@ AnimateDiff tends to work better with finetuned Stable Diffusion models. If you
 
 </Tip>
 
+### AnimateDiffVideoToVideoPipeline
+
+AnimateDiff can also be used to generate visually similar videos or enable style/character/background or other edits starting from an initial video, allowing you to seamlessly explore creative possibilities.
+
+```python
+import imageio
+import requests
+import torch
+from diffusers import AnimateDiffVideoToVideoPipeline, DDIMScheduler, MotionAdapter
+from diffusers.utils import export_to_gif
+from io import BytesIO
+from PIL import Image
+
+# Load the motion adapter
+adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
+# load SD 1.5 based finetuned model
+model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
+pipe = AnimateDiffVideoToVideoPipeline.from_pretrained(model_id, motion_adapter=adapter, torch_dtype=torch.float16)
+scheduler = DDIMScheduler.from_pretrained(
+    model_id,
+    subfolder="scheduler",
+    clip_sample=False,
+    timestep_spacing="linspace",
+    beta_schedule="linear",
+    steps_offset=1,
+)
+pipe.scheduler = scheduler
+
+# enable memory savings
+pipe.enable_vae_slicing()
+pipe.enable_model_cpu_offload()
+
+# helper function to load videos
+def load_video(file_path: str):
+    images = []
+
+    if file_path.startswith(('http://', 'https://')):
+        # If the file_path is a URL
+        response = requests.get(file_path)
+        response.raise_for_status()
+        content = BytesIO(response.content)
+        vid = imageio.get_reader(content)
+    else:
+        # Assuming it's a local file path
+        vid = imageio.get_reader(file_path)
+
+    for frame in vid:
+        pil_image = Image.fromarray(frame)
+        images.append(pil_image)
+
+    return images
+
+video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif")
+
+output = pipe(
+    video = video,
+    prompt="panda playing a guitar, on a boat, in the ocean, high quality",
+    negative_prompt="bad quality, worse quality",
+    guidance_scale=7.5,
+    num_inference_steps=25,
+    strength=0.5,
+    generator=torch.Generator("cpu").manual_seed(42),
+)
+frames = output.frames[0]
+export_to_gif(frames, "animation.gif")
+```
+
+Here are some sample outputs:
+
+<table>
+    <tr>
+      <th align=center>Source Video</th>
+      <th align=center>Output Video</th>
+    </tr>
+    <tr>
+        <td align=center>
+          raccoon playing a guitar
+          <br />
+          <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-1.gif"
+              alt="racoon playing a guitar"
+              style="width: 300px;" />
+        </td>
+        <td align=center>
+          panda playing a guitar
+          <br/>
+          <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-output-1.gif"
+              alt="panda playing a guitar"
+              style="width: 300px;" />
+        </td>
+    </tr>
+    <tr>
+        <td align=center>
+          closeup of margot robbie, fireworks in the background, high quality
+          <br />
+          <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-input-2.gif"
+              alt="closeup of margot robbie, fireworks in the background, high quality"
+              style="width: 300px;" />
+        </td>
+        <td align=center>
+          closeup of tony stark, robert downey jr, fireworks
+          <br/>
+          <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-vid2vid-output-2.gif"
+              alt="closeup of tony stark, robert downey jr, fireworks"
+              style="width: 300px;" />
+        </td>
+    </tr>
+</table>
+
 ## Using Motion LoRAs
 
 Motion LoRAs are a collection of LoRAs that work with the `guoyww/animatediff-motion-adapter-v1-5-2` checkpoint. These LoRAs are responsible for adding specific types of motion to the animations.

diff --git a/docs/source/ja/_toctree.yml b/docs/source/ja/_toctree.yml
@@ -11,4 +11,6 @@
 - sections:
   - local: tutorials/tutorial_overview
     title: 概要
+  - local: tutorials/autopipeline
+    title: AutoPipeline
   title: チュートリアル