Technology
A Hybrid Neural Net System for State-of-the-Art Continuous Speech Recognition
Zavaliagkos, G., Zhao, Y., Schwartz, R., Makhoul, J.
Untill recently, state-of-the-art, large-vocabulary, continuous speech recognition (CSR) has employed Hidden Markov Modeling (HMM) to model speech sounds. In an attempt to improve over HMM we developed a hybrid system that integrates HMM technology with neural networks. We present the concept of a "Segmental Neural Net" (SNN) for phonetic modeling in CSR. By taking into account all the frames of a phonetic segment simultaneously, the SNN overcomes the well-known conditional-independence limitation of HMMs. In several speaker-independent experiments with the DARPA Resource Management corpus, the hybrid system showed a consistent improvement in performance over the baseline HMM system. 1 INTRODUCTION The current state of the art in continuous speech recognition (CSR) is based on the use of hidden Markov models (HMM) to model phonemes in context.
Transient Signal Detection with Neural Networks: The Search for the Desired Signal
Príncipe, José Carlos, Zahalka, Abir
Matched filtering has been one of the most powerful techniques employed for transient detection. Here we will show that a dynamic neural network outperforms the conventional approach. When the artificial neural network (ANN) is trained with supervised learning schemes there is a need to supply the desired signal for all time, although we are only interested in detecting the transient. In this paper we also show the effects on the detection agreement of different strategies to construct the desired signal. The extension of the Bayes decision rule (011 desired signal), optimal in static classification, performs worse than desired signals constructed by random noise or prediction during the background.
Modeling Consistency in a Speaker Independent Continuous Speech Recognition System
Konig, Yochai, Morgan, Nelson, Wooters, Chuck, Abrash, Victor, Cohen, Michael, Franco, Horacio
We would like to incorporate speaker-dependent consistencies, such as gender, in an otherwise speaker-independent speech recognition system. In this paper we discuss a Gender Dependent Neural Network (GDNN) which can be tuned for each gender, while sharing most of the speaker independent parameters. We use a classification network to help generate gender-dependent phonetic probabilities for a statistical (HMM) recognition system. The gender classification net predicts the gender with high accuracy, 98.3% on a Resource Management test set. However, the integration of the GDNN into our hybrid HMM-neural network recognizer provided an improvement in the recognition score that is not statistically significant on a Resource Management test set.
A Hybrid Linear/Nonlinear Approach to Channel Equalization Problems
Channel equalization problem is an important problem in high-speed communications. The sequences of symbols transmitted are distorted by neighboring symbols. Traditionally, the channel equalization problem is considered as a channel-inversion operation. One problem of this approach is that there is no direct correspondence between error probability and residual error produced by the channel inversion operation. In this paper, the optimal equalizer design is formulated as a classification problem. The optimal classifier can be constructed by Bayes decision rule. In general it is nonlinear. An efficient hybrid linear/nonlinear equalizer approach has been proposed to train the equalizer. The error probability of new linear/nonlinear equalizer has been shown to be better than a linear equalizer in an experimental channel. 1 INTRODUCTION
Analog Cochlear Model for Multiresolution Speech Analysis
Liu, Weimin, Andreou, Andreas G., Jr., Moise H. Goldstein
The tradeoff between time and frequency resolution is viewed as the fundamental difference between conventional spectrographic analysis and cochlear signal processing for broadband, rapid-changing signals. The model's response exhibits a wavelet-like analysis in the scale domain that preserves good temporal resolution; the frequency of each spectral component in a broadband signal can be accurately determined from the interpeak intervals in the instantaneous firing rates of auditory fibers. Such properties of the cochlear model are demonstrated with natural speech and synthetic complex signals. 1 Introduction As a nonparametric tool, spectrogram, or short-term Fourier transform, is widely used in analyzing non-stationary signals, such speech. Usually a window is applied to the running signal and then the Fourier transform is performed. The specific window applied determines the tradeoff between temporal and spectral resolutions of the analysis, as indicated by the uncertainty principle [1].
Physiologically Based Speech Synthesis
Hirayama, Makoto, Vatikiotis-Bateson, Eric, Honda, Kiyoshi, Koike, Yasuharu, Kawato, Mitsuo
This study demonstrates a paradigm for modeling speech production based on neural networks. Using physiological data from speech utterances, a neural network learns the forward dynamics relating motor commands to muscles and the ensuing articulator behavior that allows articulator trajectories to be generated from motor commands constrained by phoneme input strings and global performance parameters. From these movement trajectories, a second neural network generates PARCOR parameters that are then used to synthesize the speech acoustics.