Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Zezario, Ryandhimas E., Fu, Szu-Wei, Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, Tsao, Yu

arXiv.org Artificial Intelligence 

Abstract--In this study, we propose a cross-domain multiobjective 2.478 in unseen noise environments) over a CNN-based baseline speech assessment model called MOSA-Net, which SE model. Index Terms--non-intrusive speech assessment models, deep More specifically, MOSA-Net is designed to estimate the speech learning, multi-objective learning, speech enhancement. PEECH assessment metrics are indicators that quantitatively measure the specific attributes of speech signals. LCC by 0.021 (0.985 vs 0.964 in seen noise environments) For example, QIA-SE can improve PESQ by 0.301 Ryandhimas E. Zezario is with the Department of Computer Science and Fei Chen is with the Department of Electrical and Electronic Engineering, Southern University of Science and Technology of China, Shenzhen, China. Hsin-Min Wang is with the Institute of Information Science, Academia Sinica, Taipei, Taiwan. This testing strategy is prohibitive To attain a higher assessment accuracy, the MBNet adopts the and may not always be feasible. Hence, several objective BiasNet architecture to compensate for the biased scores of a evaluations metrics have been developed as surrogates for certain judge [49], In addition, the multi-task learning criterion human listening tests [6]-[31]. Meanwhile, different acoustic comprises two stages. The first stage includes a series of features are used as input to the assessment model to consider signal processing units designed to convert speech waveforms information from different acoustic domains [51], [52].