Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches. In this paper, we build a framework in which we can fairly compare new vocoding and acoustic modeling techniques with conventional approaches by means of a large scale crowdsourced evaluation. Results on acoustic models showed that generative adversarial networks and an autoregressive (AR) model performed better than a normal recurrent network and the AR model performed best. Evaluation on vocoders by using the same AR acoustic model demonstrated that a Wavenet vocoder outperformed classical source-filter-based vocoders. Particularly, generated speech waveforms from the combination of AR acoustic model and Wavenet vocoder achieved a similar score of speech quality to vocoded speech.
Xu, Jiaming (Chinese Academy of Sciences, Institute of Automation) | Shi, Jing (Chinese Academy of Sciences, Institute of Automation) | Liu, Guangcan (Chinese Academy of Sciences, Institute of Automation) | Chen, Xiuyi (Chinese Academy of Sciences, Institute of Automation) | Xu, Bo (Chinese Academy of Sciences, Institute of Automation)
Developing a computational auditory model to solve the cocktail party problem has long bedeviled scientists, especially for a single microphone recording. Although recent deep learning based frameworks have made significant progress in multi-talker mixed speech separation, most existing deep learning based methods, focusing on separating all the speech channels rather than selectively attending the target speech and ignoring other sounds, may fail to offer a satisfactory solution in a complex auditory scene where the number of input sounds is usually uncertain and even dynamic. In this work, we employ ideas from auditory selective attention of behavioral and cognitive neurosciences and from recent advances of memory-augmented neural networks. Specifically, a unified Auditory Selection framework with Attention and Memory (dubbed ASAM) is proposed. Our ASAM first accumulates the prior knowledge (that is the acoustic feature to one specific speaker) into a life-long memory during the training phase, meanwhile a speech perceptor is trained to extract the temporal acoustic feature and update the memory online when a salient speech is given. Then, the learned memory is utilized to interact with the mixture input to attend and filter the target frequency out from the mixture stream. Finally, the network is trained to minimize the reconstruction error of the attended speech. We evaluate the proposed approach on WSJ0 and THCHS-30 datasets and the experimental results demonstrate that our approach successfully conducts two auditory selection tasks: the top-down task-specific attention (e.g. to follow a conversation with friend) and the bottom-up stimulus-driven attention (e.g.
In this paper we demonstrate speech synthesis using different electroencephalography (EEG) feature sets recently introduced in . We make use of a recurrent neural network (RNN) regression model to predict acoustic features directly from EEG features. We demonstrate our results using EEG features recorded in parallel with spoken speech as well as using EEG recorded in parallel with listening utterances. We provide EEG based speech synthesis results for four subjects in this paper and our results demonstrate the feasibility of synthesizing speech directly from EEG features.
Feature-mapping with deep neural networks is commonly used for single-channel speech enhancement, in which a feature-mapping network directly transforms the noisy features to the corresponding enhanced ones and is trained to minimize the mean square errors between the enhanced and clean features. In this paper, we propose an adversarial feature-mapping (AFM) method for speech enhancement which advances the feature-mapping approach with adversarial learning. An additional discriminator network is introduced to distinguish the enhanced features from the real clean ones. The two networks are jointly optimized to minimize the feature-mapping loss and simultaneously mini-maximize the discrimination loss. The distribution of the enhanced features is further pushed towards that of the clean features through this adversarial multi-task training. To achieve better performance on ASR task, senone-aware (SA) AFM is further proposed in which an acoustic model network is jointly trained with the feature-mapping and discriminator networks to optimize the senone classification loss in addition to the AFM losses. Evaluated on the CHiME-3 dataset, the proposed AFM achieves 16.95% and 5.27% relative word error rate (WER) improvements over the real noisy data and the feature-mapping baseline respectively and the SA-AFM achieves 9.85% relative WER improvement over the multi-conditional acoustic model.
Recent neural networks such as WaveNet and sampleRNN that learn directly from speech waveform samples have achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. Such neural networks are being used as an alternative to vocoders and hence they are often called neural vocoders. The neural vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. However, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation happens, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. We also extend the GAN frameworks and use the discretized mixture logistic loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated discretized-mixture-of-logistics (DML) loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.