Collaborating Authors


Multi-task Learning for Voice Trigger Detection Machine Learning

We describe the design of a voice trigger detection system for smart speakers. In this study, we address two major challenges. The first is that the detectors are deployed in complex acoustic environments with external noise and loud playback by the device itself. Secondly, collecting training examples for a specific keyword or trigger phrase is challenging resulting in a scarcity of trigger phrase specific training data. We describe a two-stage cascaded architecture where a low-power detector is always running and listening for the trigger phrase. If a detection is made at this stage, the candidate audio segment is re-scored by larger, more complex models to verify that the segment contains the trigger phrase. In this study, we focus our attention on the architecture and design of these second-pass detectors. We start by training a general acoustic model that produces phonetic transcriptions given a large labelled training dataset. Next, we collect a much smaller dataset of examples that are challenging for the baseline system. We then use multi-task learning to train a model to simultaneously produce accurate phonetic transcriptions on the larger dataset \emph{and} discriminate between true and easily confusable examples using the smaller dataset. Our results demonstrate that the proposed model reduces errors by half compared to the baseline in a range of challenging test conditions \emph{without} requiring extra parameters.

Smart Home Appliances: Chat with Your Fridge Artificial Intelligence

Current home appliances are capable to execute a limited number of voice commands such as turning devices on or off, adjusting music volume or light conditions. Recent progress in machine reasoning gives an opportunity to develop new types of conversational user interfaces for home appliances. In this paper, we apply state-of-the-art visual reasoning model and demonstrate that it is feasible to ask a smart fridge about its contents and various properties of the food with close-to-natural conversation experience. Our visual reasoning model answers user questions about existence, count, category and freshness of each product by analyzing photos made by the image sensor inside the smart fridge. Users may chat with their fridge using off-the-shelf phone messenger while being away from home, for example, when shopping in the supermarket. We generate a visually realistic synthetic dataset to train machine learning reasoning model that achieves 95% answer accuracy on test data. We present the results of initial user tests and discuss how we modify distribution of generated questions for model training based on human-in-the-loop guidance. We open source code for the whole system including dataset generation, reasoning model and demonstration scripts.

Joint DNN-Based Multichannel Reduction of Acoustic Echo, Reverberation and Noise Machine Learning

--We consider the problem of simultaneous reduction of acoustic echo, reverberation and noise. In real scenarios, these distortion sources may occur simultaneously and reducing them implies combining the corresponding distortion-specific filters. As these filters interact with each other, they must be jointly optimized. We propose to model the target and residual signals after linear echo cancellation and dereverberation using a multichannel Gaussian modeling framework and to jointly represent their spectra by means of a neural network. We develop an iterative block-coordinate ascent algorithm to update all the filters. We evaluate our system on real recordings of acoustic echo, reverberation and noise acquired with a smart speaker in various situations. The proposed approach outperforms in terms of overall distortion a cascade of the individual approaches and a joint reduction approach which does not rely on a spectral model of the target and residual signals. Index T erms--Acoustic echo, reverberation, background noise, joint distortion reduction, expectation-maximization, recurrent neural network. The near-end speaker can be a few meters away from the microphones and the interactions can be subject to several distortion sources such as background noise, acoustic echo and near-end reverberation. Each of these distortion sources degrades speech quality, intelligibility and listening comfort, and must be reduced. Single-and multichannel filters have been used to reduce each of these distortion sources independently. They can be categorized into short nonlinear filters that vary quickly over time and long linear filters that are time-invariant (or slowly time-varying). Short nonlinear filters are generally used for noise reduction [1]. They are robust to the fluctuations and nonlinearities inherent to real signals. Long linear filters can be required for dereverberation [2] and echo reduction [3].

LG Pushes Smart Home Appliances to Another Dimension with Deep Learning Technology - Dealerscope


To advance the functionality of today's home appliances to a whole new level, LG Electronics (LG) is set to deliver an unparalleled level of performance and convenience into the home with deep learning technology to be unveiled at CES 2017. LG deep learning will allow home appliances to better understand their users by gathering and studying customers' lifestyle patterns over time. This process never ends and improves over time to provide customers with new solutions to everyday problems. Using multiple sensors and LG's deep learning technology, LG's newest robot vacuum cleaner will recognize objects around the room and react accordingly. By capturing surface images of the room, the intelligent cleaner remembers obstacles and learns to avoid them over time.

SK Telecom launches research project for artificial intelligence


SK Telecom Co., South Korea's top mobile carrier, said Wednesday it has launched a research project on artificial intelligence to better develop its virtual home assistant service. Earlier this month, SK Telecom introduced an advanced voice recognition service that allows users to control home appliances and listen to music using voice commands. The new service is called "NUGU," which means "who" in Korean. SK Telecom said the new service can increase its accuracy of voice recognition, using deep-learning technology. In a statement, SK Telecom said, "Through the application of a cloud-based deep-learning framework, NUGU is designed to evolve by itself.