We are excited to introduce a pioneering task in the realm of multimodal AI: generating audible-video content from textual descriptions using a latent diffusion model. To facilitate this innovative task, we have developed TAVGBench, a comprehensive benchmark dataset. This large-scale dataset encompasses an impressive 1.7 million entries, each meticulously annotated with corresponding text.
Our benchmark dataset is unprecedented in scale, comprising 1.7 million entries, each annotated with rich textual descriptions that align with the corresponding audio and video content. This extensive collection provides a robust foundation for training and evaluating text to audible-video generation models.
The annotation pipeline for TAVGBench is designed to ensure high-quality and consistent data. Each piece of audio and video is paired with detailed textual descriptions, providing a rich dataset for model training and benchmarking. This pipeline involves multiple stages of annotation and validation to guarantee the accuracy and relevance of the annotations.
The video and audio captions within TAVGBench have been open-sourced and are available for download here.
To showcase the capabilities of our approach, we have prepared a video demonstration. This demo highlights the impressive results achievable with our text to audible-video generation model, providing a tangible example of the potential applications of this technology.