diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 22613cb343ff..e218e9878599 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -75,6 +75,8 @@ title: Outpainting title: Advanced inference - sections: + - local: using-diffusers/cogvideox + title: CogVideoX - local: using-diffusers/sdxl title: Stable Diffusion XL - local: using-diffusers/sdxl_turbo @@ -129,6 +131,8 @@ title: T2I-Adapters - local: training/instructpix2pix title: InstructPix2Pix + - local: training/cogvideox + title: CogVideoX title: Models - isExpanded: false sections: diff --git a/docs/source/en/training/cogvideox.md b/docs/source/en/training/cogvideox.md new file mode 100644 index 000000000000..657e58bfd5eb --- /dev/null +++ b/docs/source/en/training/cogvideox.md @@ -0,0 +1,291 @@ + +# CogVideoX + +CogVideoX is a text-to-video generation model focused on creating more coherent videos aligned with a prompt. It achieves this using several methods. + +- a 3D variational autoencoder that compresses videos spatially and temporally, improving compression rate and video accuracy. + +- an expert transformer block to help align text and video, and a 3D full attention module for capturing and creating spatially and temporally accurate videos. + +The actual test of the video instruction dimension found that CogVideoX has good effects on consistent theme, dynamic information, consistent background, object information, smooth motion, color, scene, appearance style, and temporal style but cannot achieve good results with human action, spatial relationship, and multiple objects. + +Finetuning with Diffusers can help make up for these poor results. + +## Data Preparation + +The training scripts accepts data in two formats. + +The first format is suited for small-scale training, and the second format uses a CSV format, which is more appropriate for streaming data for large-scale training. In the future, Diffusers will support the `