Alexa speech normalization AI reduces errors by up to 81%


Text normalization is a fundamental processing step in most natural language systems. In the case of Amazon's Alexa, "Book me a table at 5:00 p.m." might be transcribed by the assistant's automatic speech recognizer as "five p m" and further reformatted to "5:00PM." Then again, Alexa might convert "5:00PM" to "five thirty p m" for its text-to-speech synthesizer. So how does this work? Currently, Amazon's voice assistant relies on "thousands" of handwritten normalization rules for dates, email addresses, numbers, abbreviations, and other expressions, according to Alexa AI group applied scientist Ming Sun and Alexa Speech machine learning scientist Yuzong Liu.

Consonant-Vowel Sequences as Subword Units for Code-Mixed Languages

AAAI Conferences

In this research work, we develop a state-of-art model for identifying sentiment in Hindi-English code-mixed language. We introduce new phonemic sub-word units for Hindi-English code-mixed text along with a hierarchical deep learning model which uses these sub-word units for predicting sentiment. The results indicate that the model yields a significant increase in accuracy as compared to other models.

Speech Recognition Using Demi-Syllable Neural Prediction Model

Neural Information Processing Systems

The Neural Prediction Model is the speech recognition model based on pattern prediction by multilayer perceptrons. Its effectiveness was confirmed bythe speaker-independent digit recognition experiments. This paper presents an improvement in the model and its application to large vocabulary speech recognition, based on subword units. The improvement involves an introduction of "backward prediction," which further improves the prediction accuracy of the original model with only "forward prediction". Inapplication of the model to speaker-dependent large vocabulary speech recognition, the demi-syllable unit is used as a subword recognition unit.

Building competitive direct acoustics-to-word models for English conversational speech recognition Machine Learning

Direct acoustics-to-word (A2W) models in the end-to-end paradigm have received increasing attention compared to conventional sub-word based automatic speech recognition models using phones, characters, or context-dependent hidden Markov model states. This is because A2W models recognize words from speech without any decoder, pronunciation lexicon, or externally-trained language model, making training and decoding with such models simple. Prior work has shown that A2W models require orders of magnitude more training data in order to perform comparably to conventional models. Our work also showed this accuracy gap when using the English Switchboard-Fisher data set. This paper describes a recipe to train an A2W model that closes this gap and is at-par with state-of-the-art sub-word based models. We achieve a word error rate of 8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoder or language model. We find that model initialization, training data order, and regularization have the most impact on the A2W model performance. Next, we present a joint word-character A2W model that learns to first spell the word and then recognize it. This model provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training.

Finding Better Subword Segmentation for Neural Machine Translation Artificial Intelligence

For different language pairs, word-level neural machine translation (NMT) models with a fixed-size vocabulary suffer from the same problem of representing out-of-vocabulary (OOV) words. The common practice usually replaces all these rare or unknown words with a token, which limits the translation performance to some extent. Most of recent work handled such a problem by splitting words into characters or other specially extracted subword units to enable open-vocabulary translation. Byte pair encoding (BPE) is one of the successful attempts that has been shown extremely competitive by providing effective subword segmentation for NMT systems. In this paper, we extend the BPE style segmentation to a general unsupervised framework with three statistical measures: frequency (FRQ), accessor variety (AV) and description length gain (DLG). We test our approach on two translation tasks: German to English and Chinese to English. The experimental results show that AV and DLG enhanced systems outperform the FRQ baseline in the frequency weighted schemes at different significant levels.