Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval

Soo Whan Chung, Joon Son Chung, Hong Goo Kang

Research output: Contribution to journalArticlepeer-review

10 Citations (Scopus)


This paper proposes a new strategy for learning effective cross-modal joint embeddings using self-supervision. We set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant data in one domain given input in another. The method builds on the recent advances in learning representations from cross-modal self-supervision using contrastive or binary cross-entropy loss functions. To investigate the robustness of the proposed learning strategy across multi-modal applications, we perform experiments for two applications - audio-visual synchronisation and cross-modal biometrics. The audio-visual synchronisation task requires temporal correspondence between modalities to obtain joint representation of phonemes and visemes, and the cross-modal biometrics task requires common speakers representations given their face images and audio tracks. Experiments show that the performance of systems trained using proposed method far exceed that of existing methods on both tasks, whilst allowing significantly faster training.

Original languageEnglish
Article number9067055
Pages (from-to)568-576
Number of pages9
JournalIEEE Journal on Selected Topics in Signal Processing
Issue number3
Publication statusPublished - 2020 Mar

Bibliographical note

Funding Information:
Manuscript received September 16, 2019; revised March 24, 2020; accepted April 5, 2020. Date of publication April 14, 2020; date of current version June 24, 2020. This work was supported by Naver Corporation. The guest editor coordinating the review of this manuscript and approving it for publication was Dr. Xiaodong He. (Corresponding author: Hong Goo Kang.) Soo-Whan Chung and Hong-Goo Kang are with the Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, South Korea (e-mail:;

Publisher Copyright:
© 2007-2012 IEEE.

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Electrical and Electronic Engineering


Dive into the research topics of 'Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval'. Together they form a unique fingerprint.

Cite this