End-to-end speech recognition with neon - Nervana
Thus, given a sequence of frames corresponding to an utterance, the model is required to produce, for each frame, a probability distribution over the alphabet. During the training phase, the softmax outputs are fed into a CTC cost function (more on this shortly) which uses the actual transcripts to (i) score the model's predictions, and (ii) generate an error signal quantifying the accuracy of the model's predictions. The overall goal is to train the model to increase the overall score of its predictions relative to the actual transcripts. Empirically, we have found that using stochastic gradient descent with momentum paired with gradient clipping leads to the best performing models. Deeper networks (seven layers or more) also tend to perform better in general.
Feb-5-2017, 19:10:45 GMT
- Technology: