Audio-visual recognition of overlapped speech for the LRS2 Dataset

Authors: Jianwei Yu, Shixiong Zhang, Jian Wu, Shahram Ghorbani, Bo Wu, Shiyin Kang, Shansong Liu, Xunying Liu, Helen Meng, Dong Yu

Abstract: Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second, purposefully designed modality fusion gates are used to robustly integrate the audio and visual features. Third, in contrast to a traditional pipelined architecture containing explicit speech separation and recognition components, a streamlined and integrated AVSR system optimized consistently using the lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The proposed LF-MMI time-delay neural network (TDNN) system establishes the state-of-the-art for the LRS2 dataset. Experiments on overlapped speech simulated from the LRS2 dataset suggest the proposed AVSR system outperformed the audio only baseline LF-MMI DNN system by up to 29.98% absolute in word error rate (WER) reduction, and produced recognition performance comparable to a more complex pipelined system. Consistent performance improvements of 4.89% absolute in WER reduction over the baseline AVSR system using feature fusion are also obtained.

Piplined and integrated AVSR systems:

viusal modality driven gate

Audio-viusal Speech Seperation Network Architecture:

viusal modality driven gate

Seperated audio samples of -5dB mixture:

No. Mixture input Audio-only separation Audio-visual separation Ground truth
1
2
3

Seperated audio samples of 0dB mixture:

No. Mixture input Audio-only separation Audio-visual separation Ground truth
1
2
3

Seperated audio samples of 5dB mixture:

No. Mixture input Audio-only separation Audio-visual separation Ground truth
1
2
3

Seperated audio samples of 10dB mixture:

No. Mixture input Audio-only separation Audio-visual separation Ground truth
1
2
3

Si-SNR evaluation results of the separation system

viusal modality driven gate

Integrated system architecture

viusal modality driven gate viusal modality driven gate viusal modality driven gate
(a) Feature concatenation (concat) (b) Audio driven modality fusion gate (A⊗V) (c) Audio-visual driven modality fusion gate (AV⊗V)

Word Error Rate (WER) of Pipelined and Integrated Systems

viusal modality driven gate

Reference

[1] Noda K, Yamaguchi Y, Nakadai K, et al. Audio-visual speech recognition using deep learning[J]. Applied Intelligence, 2015, 42(4): 722-737.

[2] Kolbak M, Yu D, Tan Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2017, 25(10): 1901-1913.

[3] Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, MengYu, Lei Xie, and Dong Yu, “Time domain audio visual speechseparation,”arXiv preprint arXiv:1904.03760, 2019.