"Automatic speech recognition (ASR) is one of the fastest growing and commercially most promising applications of natural language technology. Speech is the most natural communicative medium for humans in many situations, including applications such as giving dictation; querying database or information-retrieval systems; or generally giving commands to a computer or other device, especially in environments where keyboard input is awkward or impossible (for example, because one's hands are required for other tasks)."
– from Linguistic Knowledge and Empirical Methods in Speech Recognition. By Andreas Stolcke. (1997). AI Magazine 18 (4): 25-32.
Audi is joining the growing list of automakers, including BMW and Toyota, choosing to add Alexa voice control to their vehicles. The company will integrate Amazon's voice assistant into select models in North America and Europe, starting in January 2019 and with the newly unveiled E-Tron electric SUV. Audi will load Alexa onto the selected vehicles' infotainment systems, so there's no need to dock your phones -- simply link your car to your Amazon account and then activate the assistant through the onboard voice control system. You'll then be able to ask Alexa to play music, read audiobooks, order groceries, tell you sports scores and to add items to your shopping lists while driving. Ned Curic, VP of Alexa Auto, says you'll also be able to ask Alexa to locate points of interest, as well as to control smart devices from your car.
MIT computer scientists have developed a system that learns to identify objects within an image, based on a spoken description of the image. Given an image and an audio caption, the model will highlight in real-time the relevant regions of the image being described. Unlike current speech-recognition technologies, the model doesn't require manual transcriptions and annotations of the examples it's trained on. Instead, it learns words directly from recorded speech clips and objects in raw images, and associates them with one another. The model can currently recognize only several hundred different words and object types.
Amazon's voice-activated assistant Alexa has a significant presence in consumers' homes, thanks to the robust ecosystem of developers and manufacturers who are incorporating Alexa into their devices. Now, Amazon plans to step up the momentum by producing more of its own Alexa-powered devices, according to a report. The Seattle-based tech giant plans to release at least eight new voice-controlled devices this year, according to CNBC, including a number of home gadgets. They include a microwave, an amplifier, a receiver, a subwoofer and an in-car device -- all of which will either have Alexa built in or will be Alexa-enabled. As CNBC notes, Amazon could drive sales of its in-home devices through partnerships with home-installation companies -- a strategy that Sonos has successfully used to bring its own Alexa-enabled speakers into homes.
Google's Now Playing song recognition was clever when it premiered late in 2017, but it had its limits. When it premiered on the Pixel 2, for instance, its on-device database could only recognize a relatively small number of songs. Now, however, that same technology is available in the cloud through Sound Search -- and it's considerably more useful if you're tracking down an obscure title. The system still uses a neural network to develop "fingerprints" identifying each song, and uses a combination of algorithms to both whittle down the list of candidates and study those results for a match. However, the scale and quality of that song matching is now much stronger.
As I approached San Francisco International Airport, my expectations for BMW's new concept car were as big as the looming Boeing 777F Lufthansa cargo jet waiting for me. I had surrendered my cellphone and everything in my purse but my drivers license to see BMW's iNext vehicle. Its tour started in Munich a few days earlier; it came to the Bay Area after a stop at New York's JFK airport, and was scheduled to continue on to Beijing. SEE ALSO: BMW makes sure we can't escape voice assistants while driving After passing a final security check, I climbed up the rickety staircase with fellow media members and entered the cavernous aircraft. We had been told very little about what we were going to see, except it was not only the "car of the future" but the "idea of the future."
One of the most popular smart speakers and one of the most popular smart bulbs are bundled up for one low price today--a great starter pack for any smart home. The Echo Plus and Philips Hue bulb combo costs $100Remove non-product link, a steep 39% discount from the $165 list price, and the cheapest this bundle has ever been. The Echo Plus offers all the smarts Alexa provides to other devices in the Echo lineup, so you'll be able do use your voice to control music, shop, check the weather and news, and more. But the Plus has a bonus: a built-in Zigbee hub, which means you can connect and control any compatible smart device all in one place. Additionally, a seven-microphone array provides far-field voice recognition so you can easily give commands even from across the room.
Automatic speech recognition can potentially benefit from the lip motion patterns, complementing acoustic speech to improve the overall recognition performance, particularly in noise. In this paper we propose an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the recognition accuracy in both clean and noisy conditions. We test our strategy on the TCD-TIMIT and LRS2 datasets, designed for large vocabulary continuous speech recognition, applying three types of noise at different power ratios. We also exploit state of the art Sequence-to-Sequence architectures, showing that our method can be easily integrated. Results show relative improvements from 7% up to 30% on TCD-TIMIT over the acoustic modality alone, depending on the acoustic noise level. We anticipate that the fusion strategy can easily generalise to many other multimodal tasks which involve correlated modalities.
Speaker verification performance in neutral talking environment is usually high, while it is sharply decreased in emotional talking environments. This performance degradation in emotional environments is due to the problem of mismatch between training in neutral environment while testing in emotional environments. In this work, a three-stage speaker verification architecture has been proposed to enhance speaker verification performance in emotional environments. This architecture is comprised of three cascaded stages: gender identification stage followed by an emotion identification stage followed by a speaker verification stage. The proposed framework has been evaluated on two distinct and independent emotional speech datasets: in-house dataset and Emotional Prosody Speech and Transcripts dataset. Our results show that speaker verification based on both gender information and emotion information is superior to each of speaker verification based on gender information only, emotion information only, and neither gender information nor emotion information. The attained average speaker verification performance based on the proposed framework is very alike to that attained in subjective assessment by human listeners.
Nowadays, machine learning based Automatic Speech Recognition (ASR) technique has widely spread in smartphones, home devices, and public facilities. As convenient as this technology can be, a considerable security issue also raises -- the users' speech content might be exposed to malicious ASR monitoring and cause severe privacy leakage. In this work, we propose HASP -- a high-performance security enhancement approach to solve this security issue on mobile devices. Leveraging ASR systems' vulnerability to the adversarial examples, HASP is designed to cast human imperceptible adversarial noises to real-time speech and effectively perturb malicious ASR monitoring by increasing the Word Error Rate (WER). To enhance the practical performance on mobile devices, HASP is also optimized for effective adaptation to the human speech characteristics, environmental noises, and mobile computation scenarios. The experiments show that HASP can achieve optimal real-time security enhancement: it can lead an average WER of 84.55% for perturbing the malicious ASR monitoring, and the data processing speed is 15x to 40x faster compared to the state-of-the-art methods. Moreover, HASP can effectively perturb various ASR systems, demonstrating a strong transferability.
But the applications of AI in media are not just limited to content personalization. Media teams have to deal with manual processes for everything - from tagging the media to creating multilingual subtitles. But recent advances in AI are automating many of these tasks. Developments in computer vision, speech to text and natural language processing algorithms are changing the face of media creation, distribution and most importantly, media consumption. Voice is the most natural way for people to communicate.