Temporal filtering of visual speech for audio-visual speech recognition in acoustically and visually challenging environments

Jong Seok Lee, Cheol Hoon Park

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

The use of visual information of speech has been shown to be effective for compensating for performance degradation of acoustic speech recognition in noisy environments. However, visual noise is usually ignored in most of audio-visual speech recognition systems, while it can be included in visual speech signals during acquisition or transmission of the signals. In this paper, we present a new temporal filtering technique for extraction of noise-robust visual features. In the proposed method, a carefully designed band-pass filter is applied to the temporal pixel value sequences of lip region images in order to remove unwanted temporal variations due to visual noise, illumination conditions or speakers' appearances. We demonstrate that the method can improve not only visual speech recognition performance for clean and noisy images but also audio-visual speech recognition performance in both acoustically and visually noisy conditions.

Original languageEnglish
Title of host publicationProceedings of the 9th International Conference on Multimodal Interfaces, ICMI'07
Pages220-227
Number of pages8
DOIs
Publication statusPublished - 2007
Event9th International Conference on Multimodal Interfaces, ICMI 2007 - Nagoya, Japan
Duration: 2007 Nov 122007 Nov 15

Publication series

NameProceedings of the 9th International Conference on Multimodal Interfaces, ICMI'07

Other

Other9th International Conference on Multimodal Interfaces, ICMI 2007
Country/TerritoryJapan
CityNagoya
Period07/11/1207/11/15

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Computer Graphics and Computer-Aided Design
  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'Temporal filtering of visual speech for audio-visual speech recognition in acoustically and visually challenging environments'. Together they form a unique fingerprint.

Cite this