"Automatic speech recognition (ASR) is one of the fastest growing and commercially most promising applications of natural language technology. Speech is the most natural communicative medium for humans in many situations, including applications such as giving dictation; querying database or information-retrieval systems; or generally giving commands to a computer or other device, especially in environments where keyboard input is awkward or impossible (for example, because one's hands are required for other tasks)."
– from Linguistic Knowledge and Empirical Methods in Speech Recognition. By Andreas Stolcke. (1997). AI Magazine 18 (4): 25-32.
Microsoft is the market leader when it comes to providing infrastructure as a service (IaaS) and platform as a service (PaaS) solutions. Microsoft Azure is the project that has not only benefitted the company in terms of ROI but has also changed the business dynamics of organizations around the globe. More and more companies are adopting Azure for their cloud and data products. Microsoft Azure AI was launched in 2018 and has emerged to be a success in the artificial intelligence services market too. Azure AI is a set of AI services built on Microsoft's breakthrough innovation from decades of world-class research in vision, speech, language processing, and custom machine learning.
SAN FRANCISCO – The ascension of natural language processing through the ranks of artificial intelligence technologies is fairly evident. Its consumerization is demonstrated in a number of audio-related household gadgets, it's found in the most effective text analytics tools, and it's an integral aspect of speech recognition systems. Still, NLP is arguably producing the greatest impact on the enterprise in furthering the self-service movement, particularly in terms of the various implements required for Business Intelligence. According to MicroStrategy VP of Product Marketing Vijay Anand, however, the business value delivered by NLP hinges on more than simply comprehending the intention of the user: "Even with natural language queries and Alexa and all of these natural language tools, the problem of the deficit of these tools as we examine it [is] while it's easy to ask a question, I, for one, certainly believe that most people don't know what the right question is. You need to have that sort of understanding of the business to ask the correct question to get the right answer, first of all."
LAS VEGAS - A new idea surrounding IoT will steer how technology will go in the new decade - instead of standing for the Internet of Things, the acronym should stand for the "intelligence of things", said Consumer Technology Association's (CTA) vice president of research Steve Koenig. "This new IoT bears testimony to the extent that artificial intelligence (AI) is permeating every facet of our commerce and our culture. "Now, commerce is pretty well-understood and we endorse that as we want to advance our economies around the world, but culture is really interesting to me as a researcher, because we're talking about technology's influence on human behaviour," he said. He brought up the example of how fast food giant McDonald's is looking at bringing AI-powered voice assistants to its drive-through restaurants in the United States. "People working in fast food - they've got a tough job.
LG is throwing its resources behind developing a new breed of AI assistants that can be used to control aspects of cars. The Korean tech company said it has partnered with AI company Cerence to make an AI voice-assistant that is capable of being used to control various aspects of car's entertainment system, navigation, calling and more. That AI assistant, once completed, will eventually be integrated into the company's webOS software that, similarly to Apple CarPlay, powers computers inside vehicles. LG is planning on leasing its AI assistant out to auto manufacturers in search of an added dose of technology in their vehicles. The company's decision to enter the ring on developing an in-car voice assistant comes at a time when other major auto-manufacturers have also announced their intention to create similar products.
Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant.
We propose using correlated bigram LSA for unsupervised LM adaptation for automatic speech recognition. The model is trained using efficient variational EM and smoothed using the proposed fractional Kneser-Ney smoothing which handles fractional counts. Our approach can be scalable to large training corpora via bootstrapping of bigram LSA from unigram LSA. For LM adaptation, unigram and bigram LSA are integrated into the background N-gram LM via marginal adaptation and linear interpolation respectively. Experimental results show that applying unigram and bigram LSA together yields 6%--8% relative perplexity reduction and 0.6% absolute character error rates (CER) reduction compared to applying only unigram LSA on the Mandarin RT04 test set.
Real-time automatic speech recognition (ASR) on mobile and embedded devices has been of great interests for many years. We present real-time speech recognition on smartphones or embedded systems by employing recurrent neural network (RNN) based acoustic models, RNN based language models, and beam-search decoding. The acoustic model is end-to-end trained with connectionist temporal classification (CTC) loss. The RNN implementation on embedded devices can suffer from excessive DRAM accesses because the parameter size of a neural network usually exceeds that of the cache memory and the parameters are used only once for each time step. To remedy this problem, we employ a multi-time step parallelization approach that computes multiple output samples at a time with the parameters fetched from the DRAM.
Generating adversarial examples is a critical step for evaluating and improving the robustness of learning machines. So far, most existing methods only work for classification and are not designed to alter the true performance measure of the problem at hand. We introduce a novel flexible approach named Houdini for generating adversarial examples specifically tailored for the final performance measure of the task considered, be it combinatorial and non-decomposable. We successfully apply Houdini to a range of applications such as speech recognition, pose estimation and semantic segmentation. In all cases, the attacks based on Houdini achieve higher success rate than those based on the traditional surrogates used to train the models while using a less perceptible adversarial perturbation.
Neural networks have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones.
We present a factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision. Specifically, we exploit the multi-scale nature of information in sequential data by formulating it explicitly within a factorized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables. The model is evaluated on two speech corpora to demonstrate, qualitatively, its ability to transform speakers or linguistic content by manipulating different sets of latent variables; and quantitatively, its ability to outperform an i-vector baseline for speaker verification and reduce the word error rate by as much as 35% in mismatched train/test scenarios for automatic speech recognition tasks. Papers published at the Neural Information Processing Systems Conference.