AOBERT: All-modalities-in-One BERT for multimodal sentiment analysis

Kyeonghun Kim, Sanghyun Park

Research output: Contribution to journalArticlepeer-review

21 Citations (Scopus)


Multimodal sentiment analysis utilizes various modalities such as Text, Vision and Speech to predict sentiment. As these modalities have unique characteristics, methods have been developed for fusing features. However, the overall modality characteristics are not guaranteed, because traditional fusion methods have some loss of intra-modality and inter-modality. To solve this problem, we introduce a single-stream transformer, All-modalities-in-One BERT (AOBERT). The model is pre-trained on two tasks simultaneously: Multimodal Masked Language Modeling (MMLM) and Alignment Prediction (AP). The dependency and relationship between modalities can be determined using two pre-training tasks. AOBERT achieved state-of-the-art results on the CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets. Furthermore, ablation studies that validated combinations of modalities, effects of MMLM and AP and fusion methods confirmed the effectiveness of the proposed model.

Original languageEnglish
Pages (from-to)37-45
Number of pages9
JournalInformation Fusion
Publication statusPublished - 2023 Apr

Bibliographical note

Publisher Copyright:
© 2022 Elsevier B.V.

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Information Systems
  • Hardware and Architecture


Dive into the research topics of 'AOBERT: All-modalities-in-One BERT for multimodal sentiment analysis'. Together they form a unique fingerprint.

Cite this