Final Project for UVa Vision and Language course [Homepage]
Our project tries to enrich the sequence-to-sequence structure video captioning model with an action recognition model to emphasize the main action scene displayed in a video clip. For more details, please refer to our report. [final report]
This model introduce the action feature into the decoder GRU by using a novel GRU unit called GFRU (first introduced in this paper: Lei Li, Yongfeng Zhang, Li Chen. Generate Neural Template Explanations for Recommendation. CIKM'20. [Paper])
For video captioning, we use the MSR-VTT dataset. Since the VTT Challenge has ended, all the data (including test set) can be found online. I download and proceed it in here: https://drive.google.com/drive/folders/1l-P5NVzGQtYWHCsbcqHrzfcxnmaVecoT?usp=sharing (NOTE: the link is restricted to people with UVa email)
We reference some model structure and data processing settings provided by the following github repo:
- For video captioning (S2VT): https://github.com/xiadingZ/video-caption.pytorch
- For action recognition (ST-GCN): https://github.com/open-mmlab/mmskeleton
- Model Fusion (GFRU): https://github.com/lileipisces/NETE
- Venugopalan, Subhashini, et al. "Sequence to sequence-video to text." Proceedings of the IEEE international conference on computer vision. 2015. [Paper]
- Yan, Sijie, Yuanjun Xiong, and Dahua Lin. "Spatial temporal graph convolutional networks for skeleton-based action recognition." arXiv preprint arXiv:1801.07455 (2018). [Paper]
- awesome video captioning: https://github.com/tgc1997/Awesome-Video-Captioning