Abstract
This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronisation. Here, we set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant audio segment given a short video clip. The method builds on the recent advances in learning representations from cross-modal self-supervision. The main contributions of this paper are as follows: (1) we propose a new learning strategy where the embeddings are learnt via a multi-way matching problem, as opposed to a binary classification (matching or non-matching) problem as proposed by recent papers; (2) we demonstrate that performance of this method far exceeds the existing baselines on the synchronisation task; (3) we use the learnt embeddings for visual speech recognition in self-supervision, and show that the performance matches the representations learnt end-to-end in a fully-supervised manner.
Original language | English |
---|---|
Title of host publication | 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 3965-3969 |
Number of pages | 5 |
ISBN (Electronic) | 9781479981311 |
DOIs | |
Publication status | Published - 2019 May |
Event | 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom Duration: 2019 May 12 → 2019 May 17 |
Publication series
Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
---|---|
Volume | 2019-May |
ISSN (Print) | 1520-6149 |
Conference
Conference | 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 |
---|---|
Country/Territory | United Kingdom |
City | Brighton |
Period | 19/5/12 → 19/5/17 |
Bibliographical note
Publisher Copyright:© 2019 IEEE.
All Science Journal Classification (ASJC) codes
- Software
- Signal Processing
- Electrical and Electronic Engineering