Goto

Collaborating Authors

Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

Neural Information Processing Systems

Neural networks have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones.


Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

Neural Information Processing Systems

Neural networks have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.


End-to-end speech recognition with neon - Nervana

#artificialintelligence

Thus, given a sequence of frames corresponding to an utterance, the model is required to produce, for each frame, a probability distribution over the alphabet. During the training phase, the softmax outputs are fed into a CTC cost function (more on this shortly) which uses the actual transcripts to (i) score the model's predictions, and (ii) generate an error signal quantifying the accuracy of the model's predictions. The overall goal is to train the model to increase the overall score of its predictions relative to the actual transcripts. Training Empirically, we have found that using stochastic gradient descent with momentum paired with gradient clipping leads to the best performing models. Deeper networks (seven layers or more) also tend to perform better in general.


A Transformer with Interleaved Self-attention and Convolution for Hybrid Acoustic Models

arXiv.org Machine Learning

ABSTRACT Transformer with self-attention has achieved great success in the area of nature language processing. Recently, there have been a few studies on transformer for end-to-end speech recognition, while its application for hybrid acoustic model is still very limited. In this paper, we revisit the transformer-based hybrid acoustic model, and propose a model structure with interleaved self-attention and 1D convolution, which is proven to have faster convergence and higher recognition accuracy. We also study several aspects of the transformer model, including the impact of the positional encoding feature, dropout regularization, as well as training with and without time restriction. We show competitive recognition results on the public Librispeech dataset when compared to the Kaldi baseline at both cross entropy training and sequence training stages. For reproducible research, we release our source code and recipe within the PyKaldi2 toolbox.


Letter-Based Speech Recognition with Gated ConvNets

arXiv.org Artificial Intelligence

In the recent literature, "end-to-end" speech systems often refer to letter-based acoustic models trained in a sequence-to-sequence manner, either via a recurrent model or via a structured output learning approach (such as CTC). In contrast to traditional phone (or senone)-based approaches, these "end-to-end'' approaches alleviate the need of word pronunciation modeling, and do not require a "forced alignment" step at training time. Phone-based approaches remain however state of the art on classical benchmarks. In this paper, we propose a letter-based speech recognition system, leveraging a ConvNet acoustic model. Key ingredients of the ConvNet are Gated Linear Units and high dropout. The ConvNet is trained to map audio sequences to their corresponding letter transcriptions, either via a classical CTC approach, or via a recent variant called ASG. Coupled with a simple decoder at inference time, our system matches the best existing letter-based systems on WSJ (in word error rate), and shows near state of the art performance on LibriSpeech.