Speech Recognition

Simplifying Conversational AI, One Interaction At A Time


What if we could speak with our devices, cars, and homes just as easily as we do with our friends? Conversation is the bedrock of human communication, a transformative tool that reveals what's inside our heads and hearts. Voice is our primary means of connecting with others--and, increasingly, it's how we want to engage with the machines around us, too. The art of human conversation can be maddeningly difficult for even very sophisticated machines, but we're on a path to creating solutions that are much closer to what we need. Thanks to advances in speech recognition, artificial intelligence, neural networks, and processing power, we can tap into the capabilities of our machines simply by speaking.

Herbie Unique & Intelligent Voice Assistant


We understand AI, data and the cloud, and how to build integrated intelligence into applications using the most advanced cloud technologies. Wherever you are in your AI journey, we can help you modernize the way your business. Advanced Business Dashboard and Analytical reports helps you gauge the ROI of Herbie Implementation. Gives insights into Client conversation widths, locations and satisfaction level of the clients. Measure the increase in leads with Herbie deployment and analyze the reduction in Operational Cost.

Narrow Artificial Intelligence and its scopes


An ability to adapt to machine and learning to a machine is Artificial Intelligence, as described in brief. But artificial intelligence is more than we know and perceived to be. There are 3 types of artificial intelligence. Let's look at the first type of AI and broaden our knowledge about narrow artificial intelligence: Artificial Intelligence proved that technology could imitate the human brain and actions. Narrow artificial Intelligence or narrow AI is a specific technology that it can imitate the human action to accomplish a task, which is narrowly defined.

The Peggy Smedley Show: Are Voice Assistants Creepy?


Are privacy and security a top concern with voice assistants? Peggy answers, indicating how many people find it creepy when they get ads based on something they have talked about around a voice assistant. She explains while most people enjoy the benefits of personalization of voice assistants, we still need to be asking: What are we giving away when we talk to our voice assistants?

Facebook will pay for users' voice recordings after it was caught listening to Messenger chats

Daily Mail - Science & tech

Facebook says it will start paying users to harvest their voice data for training speech recognition software after it was caught analyzing their speech without permission last year. In a program called'Pronunciations', participants will be payed a small sum, only up to $5, to use the company's market research app Viewpoints for recording various words and phrases that the company will then leverage to train its speech recognition AI. That voice data will be used to improve products like Portal, which is Facebook's smart display that can be used for video-calling among other things and can be activated with one's voice. In the program, participants, who must be at least 18-years-old, will have to utter specific phrases like'Hey Portal' and also say the first names of 10 of their friends on Facebook. For each'set' of prompts participants will receive 200 points.

Correlated Bigram LSA for Unsupervised Language Model Adaptation

Neural Information Processing Systems

We propose using correlated bigram LSA for unsupervised LM adaptation for automatic speech recognition. The model is trained using efficient variational EM and smoothed using the proposed fractional Kneser-Ney smoothing which handles fractional counts. Our approach can be scalable to large training corpora via bootstrapping of bigram LSA from unigram LSA. For LM adaptation, unigram and bigram LSA are integrated into the background N-gram LM via marginal adaptation and linear interpolation respectively. Experimental results show that applying unigram and bigram LSA together yields 6%--8% relative perplexity reduction and 0.6% absolute character error rates (CER) reduction compared to applying only unigram LSA on the Mandarin RT04 test set.

Phoneme Recognition with Large Hierarchical Reservoirs

Neural Information Processing Systems

Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant.

Fully Neural Network Based Speech Recognition on Mobile and Embedded Devices

Neural Information Processing Systems

Real-time automatic speech recognition (ASR) on mobile and embedded devices has been of great interests for many years. We present real-time speech recognition on smartphones or embedded systems by employing recurrent neural network (RNN) based acoustic models, RNN based language models, and beam-search decoding. The acoustic model is end-to-end trained with connectionist temporal classification (CTC) loss. The RNN implementation on embedded devices can suffer from excessive DRAM accesses because the parameter size of a neural network usually exceeds that of the cache memory and the parameters are used only once for each time step. To remedy this problem, we employ a multi-time step parallelization approach that computes multiple output samples at a time with the parameters fetched from the DRAM.

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Neural Information Processing Systems

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpora of their respective modalities in an unsupervised fashion. The proposed framework learns the individual speech and text embedding spaces, and attempts to align the two spaces via adversarial training, followed by a refinement procedure. We show how our framework could be used to perform the tasks of spoken word classification and translation, and the experimental results on these two tasks demonstrate that the performance of our unsupervised alignment approach is comparable to its supervised counterpart. Our framework is especially useful for developing automatic speech recognition (ASR) and speech-to-text translation systems for low- or zero-resource languages, which have little parallel audio-text data for training modern supervised ASR and speech-to-text translation models, but account for the majority of the languages spoken across the world.

Houdini: Fooling Deep Structured Visual and Speech Recognition Models with Adversarial Examples

Neural Information Processing Systems

Generating adversarial examples is a critical step for evaluating and improving the robustness of learning machines. So far, most existing methods only work for classification and are not designed to alter the true performance measure of the problem at hand. We introduce a novel flexible approach named Houdini for generating adversarial examples specifically tailored for the final performance measure of the task considered, be it combinatorial and non-decomposable. We successfully apply Houdini to a range of applications such as speech recognition, pose estimation and semantic segmentation. In all cases, the attacks based on Houdini achieve higher success rate than those based on the traditional surrogates used to train the models while using a less perceptible adversarial perturbation.