Annotation files are hosted on Huggingface.
Please download the multi-shot videos from OneDrive or HuggingFace.
We are excited to release a new video-text benchmark for multi-shot video understanding. This release contains a 134k version of our dataset. It includes detailed long summaries (human annotated + GPTV generated) for 134k videos and shot captions (human annotated) for 188k video shots. Additionally, we annotate question-answering pairs for benchmarking multi-shot video understanding.
Our 134k multi-shot videos come with detailed textual descriptions, consisting of 43k human annotation and 90k GPTV generation and covering over 548k video shots. The different files under data/annotations/
:
- 20k_{train/test/val}.json 20k-version release. We keep using testing/validation split in 134k version.
- 90k_gptv_train.json 90k subset from 134k release, of which the text summaries are generated by GPTV with long visual tokens.
- 43k_human_train.json 43k subset from 134k release, of which the text summaries are produced and rectified by human annotators, paired with 188k human-annotated video shot captions and narration captions.
- 134k_full_train.json 134k full release, covering 548k video shots.
- {testing/val}_qa.json multi-shot question-asnwering pair by manual annotation and verification. We collate and annotate QA pairs from temporal-related, holistic understanding and audio-related aspects on testing and validation videos for benchmarking.
- 20k_meta.csv meta data of our original released 20k multi-shot videos, including categories, original YouTube ID and starting-ending timestamps of the cared multi-shot video.
- 134k_meta.csv meta data of our latest released 134k multi-shot videos, including categories, original YouTube ID and starting-ending timestamps of the cared multi-shot video.
- 114k_meta.csv meta data of the update 114k multi-shot videos, in case you have the previous 20k version, including categories, original YouTube ID and starting-ending timestamps of the cared multi-shot video.
Annotations are in JSON format, with each video as a JSON object:
- video, image_id, nvid: Video file name.
- id: Unique video ID.
- whole_caption: Video summary.
- whole_ASR: Full-video ASR from Whisper Large-v2.
- video_names: Array of video shot names.
- audio_captions: Array of narration captions per shot.
- captions: Array of video captions per shot.
- ASR: Array of ASR outputs from Whisper Large-v2 per shot.
Example:
[
{
"video": "video_name.mp4",
"image_id": "video_name.mp4",
"id": 0,
"whole_caption": "summary",
"whole_ASR": "ASR output",
"nvid": "video_name.mp4",
"video_names": ["shot_name1.mp4", "shot_name2.mp4"],
"audio_captions": ["narration1", "narration2"],
"captions": ["caption1", "caption2"],
"ASR": ["ASR shot1", "ASR shot2"]
},
...
]
We provide cached multi-shot videos OneDrive and HuggingFace. It takes around 160GB of disk space and needs to extract video shots on your own.
Or, you can download on your own:
- Access Information: YouTube video ID, chapter ID, and start-end timestamps from HD-VILA-100M are in ./data/annotations/
134k_meta.csv
, or you can download the update videos in ./data/annotations/114k_meta.csv
. - Download Scripts: Use our Python scripts in
./data/scripts/download_videos.py
to download videos. Ensure you have necessary permissions. - Video Preparation: Use our code in
./data/scripts/process_videos.py
to prepare video clips and single-shot videos. As a prerequisite, please rundata/scripts/get_existing_data.py
to have all the downloaded raw videos for processing.
We uphold the rights of individuals and copyright holders. If you are featured in any of our video annotations or hold copyright to a video and wish to have its annotation removed from our dataset, please reach out to us. Send an email to [email protected] with the subject line beginning with Shot2Story-optout, or raise an issue with the same title format. We commit to reviewing your request promptly and taking suitable action.
Our text annotations are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. They are available strictly for non-commercial research.
Please note, our dataset does not include the original videos. Users must refer to HD-VILA-100M for video access. By downloading our annotations, you agree to these terms. Respect for video copyright holders is paramount. Ensure your use of the videos aligns with the original source's terms.
We extend our thanks to the teams behind HD-VILA-100M and Whisper. Our work builds upon their valuable contributions. Please acknowledge these resources in your work.