A personal paper list on Video Moment Retrieval (VMR), or Natural Language Video Localization (NLVL), or Temporal Sentence Grounding in Videos (TSGV)), Natural Language Query (NLQ).
- Keywords: moment retrieval, temporal grounding, video/language/moment grounding/localization, sentence grounding, etc.
Summarized by,
- 视频片段检索研究综述, 软件学报,2020
- A survey on temporal sentence grounding in videos. in ArXiv 2021
- The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions. in ArXiv 2022
Dataset | Video Source | Domain |
---|---|---|
TACoS | Kitchen | Cooking |
Charades-STA | Homes | Indoor Activity |
ActivityNet Captions | Youtube | Open |
DiDeMo | Flickr | Open |
MAD, CVPR22 | Movie | Open |
Referring to this paper, more info,
Dataset | Video # | VL-pair# --> train | val | Test | Vocab Size |
---|---|---|---|---|---|
ActivityNet Captions | 14926 | 37421 | 17505 | 17031 | 15406 |
TACoS | 127 | 10146 | 4589 | 4083 | 2255 |
DiDeMo | 10642 | 33005 | 4180 | 4021 | 7523 |
Charades-STA | 6670 | 12404 | - | 3720 | 1289 |
Normally, top three is widely used. Then processed feature,
Visual: 1) by 3D ConvNet, e.g. C3D, I3D 2) by 2D ConvNet, e.g. vgg
Text: 1) pretained word embeddings, e.g. GloVe 2) pre-trained language models, e.g. BERT
NEW MAD: both by CLIP.
extracted features can be downloaded from