Learning Hierarchical Modular Networks for Video Captioning

Guorong Li, Hanhua Ye, Yuankai Qi, Shuhui Wang, Laiyun Qing, Qingming Huang, Ming Hsuan Yang

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)

Abstract

Video captioning aims to generate natural language descriptions for a given video clip. Existing methods mainly focus on end-to-end representation learning via word-by-word comparison between predicted captions and ground-truth texts. Although significant progress has been made, such supervised approaches neglect semantic alignment between visual and linguistic entities, which may negatively affect the generated captions. In this work, we propose a hierarchical modular network to bridge video representations and linguistic semantics at four granularities before generating captions: entity, verb, predicate, and sentence. Each level is implemented by one module to embed corresponding semantics into video representations. Additionally, we present a reinforcement learning module based on the scene graph of captions to better measure sentence similarity. Extensive experimental results show that the proposed method performs favorably against the state-of-the-art models on three widely-used benchmark datasets, including microsoft research video description corpus (MSVD), MSR-video to text (MSR-VTT), and video-and-TEXt (VATEX).

Original languageEnglish
Pages (from-to)1049-1064
Number of pages16
JournalIEEE transactions on pattern analysis and machine intelligence
Volume46
Issue number2
DOIs
Publication statusPublished - 2024 Feb 1

Bibliographical note

Publisher Copyright:
© 1979-2012 IEEE.

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition
  • Computational Theory and Mathematics
  • Artificial Intelligence
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'Learning Hierarchical Modular Networks for Video Captioning'. Together they form a unique fingerprint.

Cite this