Audio Samples

Audio-visual recognition of overlapped speech for the LRS2 Dataset

Authors: Jianwei Yu, Shixiong Zhang, Jian Wu, Shahram Ghorbani, Bo Wu, Shiyin Kang, Shansong Liu, Xunying Liu, Helen Meng, Dong Yu

Abstract: Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second, purposefully designed modality fusion gates are used to robustly integrate the audio and visual features. Third, in contrast to a traditional pipelined architecture containing explicit speech separation and recognition components, a streamlined and integrated AVSR system optimized consistently using the lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The proposed LF-MMI time-delay neural network (TDNN) system establishes the state-of-the-art for the LRS2 dataset. Experiments on overlapped speech simulated from the LRS2 dataset suggest the proposed AVSR system outperformed the audio only baseline LF-MMI DNN system by up to 29.98% absolute in word error rate (WER) reduction, and produced recognition performance comparable to a more complex pipelined system. Consistent performance improvements of 4.89% absolute in WER reduction over the baseline AVSR system using feature fusion are also obtained.

Piplined and integrated AVSR systems:

viusal modality driven gate

Audio-viusal Speech Seperation Network Architecture:

The overlapped speech is simulated from testing of LRS2 [1].
The separation model is used in the pipelined AVSR system.
For audio-only separation, we used uPIT-BLSTM [2] model with three layer BLSTMs, dropout rate 0.5 and adopted PSM as training targets, which was proved better than IRM. 257-dimentional linear spectrogram (hop-size/window=10ms/hann) was extracted as input features. The model generates two separated results for two-speaker mixture and we choose the one with higher Si-SNR value of groundtruth target speech as the corresponding separated speech.
For audio-visual separation, the frequency-domain model [3] we presented here used 321-dimentional spectrogram (hop-size/window=10ms/hann) as audio feature and lip embedding extracted by the LipNet from the mouth region-of-interest (ROI) as visual feature. The model directly generated the separated TF-mask of the target speaker.

Seperated audio samples of -5dB mixture:

No.	Mixture input	Audio-only separation	Audio-visual separation	Ground truth
1
2
3

Seperated audio samples of 0dB mixture:

No.	Mixture input	Audio-only separation	Audio-visual separation	Ground truth
1
2
3

Seperated audio samples of 5dB mixture:

No.	Mixture input	Audio-only separation	Audio-visual separation	Ground truth
1
2
3

Seperated audio samples of 10dB mixture:

No.	Mixture input	Audio-only separation	Audio-visual separation	Ground truth
1
2
3

Si-SNR evaluation results of the separation system

viusal modality driven gate

Integrated system architecture


(a) Feature concatenation (concat)	(b) Audio driven modality fusion gate (A⊗V)	(c) Audio-visual driven modality fusion gate (AV⊗V)

Word Error Rate (WER) of Pipelined and Integrated Systems

"Multi" denotes multi-condition training data with separated speech and clean speech
"Multi*" denotes multi-condition training data with overlapped and clean speech.
"concat" denotes feature concatenation.
"A⊗V" denotes audio driven modality fusion gate; "AV⊗V" denotes audio-visual driven modality fusion gate;

viusal modality driven gate

Reference

[1] Noda K, Yamaguchi Y, Nakadai K, et al. Audio-visual speech recognition using deep learning[J]. Applied Intelligence, 2015, 42(4): 722-737.

[2] Kolbak M, Yu D, Tan Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2017, 25(10): 1901-1913.

[3] Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, MengYu, Lei Xie, and Dong Yu, “Time domain audio visual speechseparation,”arXiv preprint arXiv:1904.03760, 2019.