Space-Time Memory Networks for Video Object Segmentation With User Guidance

Seoung Wug Oh, Joon Young Lee, Ning Xu, Seon Joo Kim

Research output: Contribution to journalArticlepeer-review

12 Citations (Scopus)


We propose a novel and unified solution for user-guided video object segmentation tasks. In this work, we consider two scenarios of user-guided segmentation: semi-supervised and interactive segmentation. Due to the nature of the problem, available cues - video frame(s) with object masks (or scribbles) - become richer with the intermediate predictions (or additional user inputs). However, the existing methods make it impossible to fully exploit this rich source of information. We resolve the issue by leveraging memory networks and learning to read relevant information from all available sources. In the semi-supervised scenario, the previous frames with object masks form an external memory, and the current frame as the query is segmented using the information in the memory. Similarly, to work with user interactions, the frames that are given user inputs form the memory that guides segmentation. Internally, the query and the memory are densely matched in the feature space, covering all the space-time pixel locations in a feed-forward fashion. The abundant use of the guidance information allows us to better handle challenges such as appearance changes and occlusions. We validate our method on the latest benchmark sets and achieve state-of-the-art performance along with a fast runtime.

Original languageEnglish
Pages (from-to)442-455
Number of pages14
JournalIEEE transactions on pattern analysis and machine intelligence
Issue number1
Publication statusPublished - 2022 Jan

Bibliographical note

Publisher Copyright:
© 2020 IEEE.

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition
  • Computational Theory and Mathematics
  • Artificial Intelligence
  • Applied Mathematics


Dive into the research topics of 'Space-Time Memory Networks for Video Object Segmentation With User Guidance'. Together they form a unique fingerprint.

Cite this