Goto

Collaborating Authors

 Lee, Soo-Young


Multi-speaker Emotional Text-to-speech Synthesizer

arXiv.org Artificial Intelligence

We present a methodology to train our multi-speaker emotional text-to-speech synthesizer that can express speech for 10 speakers' 7 different emotions. All silences from audio samples are removed prior to learning. This results in fast learning by our model. Curriculum learning is applied to train our model efficiently. Our model is first trained with a large single-speaker neutral dataset, and then trained with neutral speech from all speakers. Finally, our model is trained using datasets of emotional speech from all speakers. In each stage, training samples of each speaker-emotion pair have equal probability to appear in mini-batches. Through this procedure, our model can synthesize speech for all targeted speakers and emotions. Our synthesized audio sets are available on our web page.


A Fully Time-domain Neural Model for Subband-based Speech Synthesizer

arXiv.org Artificial Intelligence

This paper introduces a deep neural network model for subband-based speech synthesizer. The model benefits from the short bandwidth of the subband signals to reduce the complexity of the time-domain speech generator. We employed the multi-level wavelet analysis/synthesis to decompose/reconstruct the signal into subbands in time domain. Inspired from the WaveNet, a convolutional neural network (CNN) model predicts subband speech signals fully in time domain. Due to the short bandwidth of the subbands, a simple network architecture is enough to train the simple patterns of the subbands accurately. In the ground truth experiments with teacher-forcing, the subband synthesizer outperforms the fullband model significantly in terms of both subjective and objective measures. In addition, by conditioning the model on the phoneme sequence using a pronunciation dictionary, we have achieved the fully time-domain neural model for subband-based text-to-speech (TTS) synthesizer, which is nearly end-to-end. The generated speech of the subband TTS shows comparable quality as the fullband one with a slighter network architecture for each subband.


Adjusting Pleasure-Arousal-Dominance for Continuous Emotional Text-to-speech Synthesizer

arXiv.org Artificial Intelligence

Emotion is not limited to discrete categories of happy, sad, angry, fear, disgust, surprise, and so on. Instead, each emotion category is projected into a set of nearly independent dimensions, named pleasure (or valence), arousal, and dominance, known as PAD. The value of each dimension varies from -1 to 1, such that the neutral emotion is in the center with all-zero values. Training an emotional continuous text-to-speech (TTS) synthesizer on the independent dimensions provides the possibility of emotional speech synthesis with unlimited emotion categories. Our end-to-end neural speech synthesizer is based on the well-known Tacotron. Empirically, we have found the optimum network architecture for injecting the 3D PADs. Moreover, the PAD values are adjusted for the speech synthesis purpose.


End-to-end Multimodal Emotion and Gender Recognition with Dynamic Weights of Joint Loss

arXiv.org Machine Learning

Multi-task learning (MTL) is one of the method for improving generalizability of multiple tasks. In order to perform multiple classification tasks with one neural network model, the losses of each task should be combined. Previous studies have mostly focused on prediction of multiple tasks using joint loss with static weights for training model. Choosing weights between tasks have not taken any considerations while it is set by uniformly or empirically. In this study, we propose a method to make joint loss using dynamic weights to improve total performance not an individual performance of tasks, and apply this method to end-to-end multimodal emotion and gender recognition model using audio and video data. This approach provides proper weights for each loss of the tasks when training ends. In our experiment, a performance of emotion and gender recognition with proposed method shows lower joint loss which is computed as negative log-likelihood than the one with static weights of joint loss. Also, our proposed model shows better generalizability than compared models. In our best knowledge, this research shows the strength of dynamic weights of joint loss for maximizing total performance at first in emotion and gender recognition task.


Combining ICA and Top-Down Attention for Robust Speech Recognition

Neural Information Processing Systems

We present an algorithm which compensates for the mismatches between characteristics of real-world problems and assumptions of independent component analysis algorithm. To provide additional information to the ICA network, we incorporate top-down selective attention. An MLP classifier is added to the separated signal channel and the error of the classifier is backpropagated to the ICA network. This backpropagation process results in estimation of expected ICA output signal for the top-down attention. Then, the unmixing matrix is retrained according to a new cost function representing the backpropagated error as well as independence.


Combining ICA and Top-Down Attention for Robust Speech Recognition

Neural Information Processing Systems

We present an algorithm which compensates for the mismatches between characteristics of real-world problems and assumptions of independent component analysis algorithm. To provide additional information to the ICA network, we incorporate top-down selective attention. An MLP classifier is added to the separated signal channel and the error of the classifier is backpropagated to the ICA network. This backpropagation process results in estimation of expected ICA output signal for the top-down attention. Then, the unmixing matrix is retrained according to a new cost function representing the backpropagated error as well as independence.



Robust Recognition of Noisy and Superimposed Patterns via Selective Attention

Neural Information Processing Systems

In many classification tasks, recognition accuracy is low because input patterns are corrupted by noise or are spatially or temporally overlapping. We propose an approach to overcoming these limitations based on a model of human selective attention. The model, an early selection filter guided by top-down attentional control, entertains each candidate output class in sequence and adjusts attentional gain coefficients in order to produce a strong response for that class. The chosen class is then the one that obtains the strongest response with the least modulation of attention. We present simulation results on classification of corrupted and superimposed handwritten digit patterns, showing a significant improvement in recognition rates.


Robust Recognition of Noisy and Superimposed Patterns via Selective Attention

Neural Information Processing Systems

The model, an early selection filter guided by top-down attentional control, entertains each candidate output class in sequence and adjusts attentional gain coefficients in order to produce a strong response for that class.


Active Noise Canceling Using Analog Neuro-Chip with On-Chip Learning Capability

Neural Information Processing Systems

A modular analogue neuro-chip set with on-chip learning capability is developed for active noise canceling. The analogue neuro-chip set incorporates the error backpropagation learning rule for practical applications, and allows pinto-pin interconnections for multi-chip boards. The developed neuro-board demonstrated active noise canceling without any digital signal processor. Multi-path fading of acoustic channels, random noise, and nonlinear distortion of the loud speaker are compensated by the adaptive learning circuits of the neuro-chips. Experimental results are reported for cancellation of car noise in real time.