Eidetic 3D LSTM: A model for video prediction and beyond

Yunbo Wang, Lu Jiang, Ming Hsuan Yang, Li Jia Li, Mingsheng Long, Li Fei-Fei

    Research output: Contribution to conferencePaperpeer-review

    106 Citations (Scopus)


    Spatiotemporal predictive learning, though long considered to be a promising self-supervised feature learning method, seldom shows its effectiveness beyond future video prediction. The reason is that it is difficult to learn good representations for both short-term frame dependency and long-term high-level relations. We present a new model, Eidetic 3D LSTM (E3D-LSTM), that integrates 3D convolutions into RNNs. The encapsulated 3D-Conv makes local perceptrons of RNNs motion-aware and enables the memory cell to store better short-term features. For long-term relations, we make the present memory state interact with its historical records via a gate-controlled self-attention module. We describe this memory transition mechanism eidetic as it is able to effectively recall the stored memories across multiple time stamps even after long periods of disturbance. We first evaluate the E3D-LSTM network on widely-used future video prediction datasets and achieve the state-of-the-art performance. Then we show that the E3D-LSTM network also performs well on the early activity recognition to infer what is happening or what will happen after observing only limited frames of video. This task aligns well with video prediction in modeling action intentions and tendency.

    Original languageEnglish
    Publication statusPublished - 2019 Jan 1
    Event7th International Conference on Learning Representations, ICLR 2019 - New Orleans, United States
    Duration: 2019 May 62019 May 9


    Conference7th International Conference on Learning Representations, ICLR 2019
    Country/TerritoryUnited States
    CityNew Orleans

    All Science Journal Classification (ASJC) codes

    • Education
    • Computer Science Applications
    • Linguistics and Language
    • Language and Linguistics


    Dive into the research topics of 'Eidetic 3D LSTM: A model for video prediction and beyond'. Together they form a unique fingerprint.

    Cite this