Goto

Collaborating Authors

 Waibel, Alex


Multilingual Adaptation of RNN Based ASR Systems

arXiv.org Artificial Intelligence

In this work, we focus on multilingual systems based on recurrent neural networks (RNNs), trained using the Connectionist Temporal Classification (CTC) loss function. Using a multilingual set of acoustic units poses difficulties. To address this issue, we proposed Language Feature Vectors (LFVs) to train language adaptive multilingual systems. Language adaptation, in contrast to speaker adaptation, needs to be applied not only on the feature level, but also to deeper layers of the network. In this work, we therefore extended our previous approach by introducing a novel technique which we call "modulation". Based on this method, we modulated the hidden layers of RNNs using LFVs. We evaluated this approach in both full and low resource conditions, as well as for grapheme and phone based systems. Lower error rates throughout the different conditions could be achieved by the use of the modulation.


Adaptively Growing Hierarchical Mixtures of Experts

Neural Information Processing Systems

We propose a novel approach to automatically growing and pruning Hierarchical Mixtures of Experts. The constructive algorithm proposed here enables large hierarchies consisting of several hundred experts to be trained effectively. We show that HME's trained by our automatic growing procedure yield better generalization performance than traditional static and balanced hierarchies. Evaluation of the algorithm is performed (1) on vowel classification and (2) within a hybrid version of the JANUS r9] speech recognition system using a subset of the Switchboard large-vocabulary speaker-independent continuous speech recognition database.


Adaptively Growing Hierarchical Mixtures of Experts

Neural Information Processing Systems

We propose a novel approach to automatically growing and pruning Hierarchical Mixtures of Experts. The constructive algorithm proposed hereenables large hierarchies consisting of several hundred experts to be trained effectively. We show that HME's trained by our automatic growing procedure yield better generalization performance thantraditional static and balanced hierarchies. Evaluation of the algorithm is performed (1) on vowel classification and (2) within a hybrid version of the JANUS r9] speech recognition systemusing a subset of the Switchboard large-vocabulary speaker-independent continuous speech recognition database.


The Use of Dynamic Writing Information in a Connectionist On-Line Cursive Handwriting Recognition System

Neural Information Processing Systems

This system combines a robust input representation, which preserves the dynamic writing information, with a neural network architecture, a so called Multi-State Time Delay Neural Network (MS-TDNN), which integrates rec.ognition and segmentation ina single framework. Our preprocessing transforms the original coordinate sequence into a (still temporal) sequence offeature vectors,which combine strictly local features, like curvature or writing direction, with a bitmap-like representation of the coordinate's proximity.The MS-TDNN architecture is well suited for handling temporal sequences as provided by this input representation. Oursystem is tested both on writer dependent and writer independent tasks with vocabulary sizes ranging from 400 up to 20,000 words. For example, on a 20,000 word vocabulary we achieve word recognition rates up to 88.9% (writer dependent) and 84.1 % (writer independent) without using any language models.


The Use of Dynamic Writing Information in a Connectionist On-Line Cursive Handwriting Recognition System

Neural Information Processing Systems

This system combines a robust input representation, which preserves the dynamic writing information, with a neural network architecture, a so called Multi-State Time Delay Neural Network (MS-TDNN), which integrates rec.ognition and segmentation in a single framework. Our preprocessing transforms the original coordinate sequence into a (still temporal) sequence offeature vectors, which combine strictly local features, like curvature or writing direction, with a bitmap-like representation of the coordinate's proximity. The MS-TDNN architecture is well suited for handling temporal sequences as provided by this input representation. Our system is tested both on writer dependent and writer independent tasks with vocabulary sizes ranging from 400 up to 20,000 words. For example, on a 20,000 word vocabulary we achieve word recognition rates up to 88.9% (writer dependent) and 84.1 % (writer independent) without using any language models.


Performance Through Consistency: MS-TDNN's for Large Vocabulary Continuous Speech Recognition

Neural Information Processing Systems

Connectionist Rpeech recognition systems are often handicapped by an inconsistency between training and testing criteria. This problem is addressed by the Multi-State Time Delay Neural Network (MS-TDNN), a hierarchical phonf'mp and word classifier which uses DTW to modulate its connectivit.y



Performance Through Consistency: MS-TDNN's for Large Vocabulary Continuous Speech Recognition

Neural Information Processing Systems

Connectionist Rpeech recognition systems are often handicapped by an inconsistency between training and testing criteria. This problem isaddressed by the Multi-State Time Delay Neural Network (MS-TDNN), a hierarchical phonf'mp and word classifier which uses DTW to modulate its connectivit.y