Loading...
Please wait, while we are loading the content...
Similar Documents
Multi-Task Video Captioning with a Stepwise Multimodal Encoder
Content Provider | MDPI |
---|---|
Author | Liu, Zihao Wu, Xiaoyu Yu, Ying |
Copyright Year | 2022 |
Description | Video captioning aims to generate a grammatical and accurate sentence to describe a video. Recent methods have mainly tackled this problem by considering multiple modalities, yet they have neglected the difference in modalities and the importance of shrinking the gap between video and text. This paper proposes a multi-task video-captioning method with a Stepwise Multimodal Encoder. The encoder can flexibly digest multiple modalities by assigning a proper encoding depth for each modality. We also exploit both video-to-text (V2T) and text-to-video (T2V) flows by adding an auxiliary task of video–text semantic matching. We successfully achieve state-of-the-art performance on two widely known datasets: MSVD and MSR-VTT: (1) with the MSVD dataset, our method achieves an 18% improvement in CIDEr; (2) with the MSR-VTT dataset, our method achieves a 6% improvement in CIDEr. |
Starting Page | 2639 |
e-ISSN | 20799292 |
DOI | 10.3390/electronics11172639 |
Journal | Electronics |
Issue Number | 17 |
Volume Number | 11 |
Language | English |
Publisher | MDPI |
Publisher Date | 2022-08-23 |
Access Restriction | Open |
Subject Keyword | Electronics Multimodal Fusion Video Captioning Multi-task Learning Computer Vision Transformer Artificial Intelligence |
Content Type | Text |
Resource Type | Article |