Abstract
Video captioning aims to generate natural language descriptions for a given video clip. Existing methods mainly focus on end-to-end representation learning via word-by-word comparison between predicted captions and ground-truth texts. Although significant progress has been made, such supervised approaches neglect semantic alignment between visual and linguistic entities, which may negatively affect the generated captions. In this work, we propose a hierarchical modular network to bridge video representations and linguistic semantics at four granularities before generating captions: entity, verb, predicate, and sentence. Each level is implemented by one module to embed corresponding semantics into video representations. Additionally, we present a reinforcement learning module based on the scene graph of captions to better measure sentence similarity. Extensive experimental results show that the proposed method performs favorably against the state-of-the-art models on three widely-used benchmark datasets, including microsoft research video description corpus (MSVD), MSR-video to text (MSR-VTT), and video-and-TEXt (VATEX).
Original language | English |
---|---|
Pages (from-to) | 1049-1064 |
Number of pages | 16 |
Journal | IEEE transactions on pattern analysis and machine intelligence |
Volume | 46 |
Issue number | 2 |
DOIs | |
Publication status | Published - 2024 Feb 1 |
Bibliographical note
Publisher Copyright:© 1979-2012 IEEE.
All Science Journal Classification (ASJC) codes
- Software
- Computer Vision and Pattern Recognition
- Computational Theory and Mathematics
- Artificial Intelligence
- Applied Mathematics