An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition
–Neural Information Processing Systems
An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain theoptimal state sequence as well as the alignment between the two sequences. One such task, which will be presented in this paper, is multimodal speech recognition usingboth a microphone and a camera recording a speaker simultaneously while he (she) speaks. It is indeed well known that seeing the speaker's face in addition tohearing his (her) voice can often improve speech intelligibility, particularly in noisy environments [7), mainly thanks to the complementarity of the visual and acoustic signals. While in the former solution, the alignment between the two sequences is decided a priori, in the latter, there is no explicit learning of the joint probability of the two sequences. In fact, the model enables to desynchronize the streams by temporarily stretching one of them in order to obtain a better match between the corresponding frames.The model can thus be directly applied to the problem of audiovisual speech recognition where sometimes lips start to move before any sound is heard for instance.
Neural Information Processing Systems
Dec-31-2003