Speech encompasses speech understanding/recognition and speech synthesis.
For technology users who have marveled at the ability of Siri or Alexa to recognize their voice, consider this: The National Security Agency has apparently been way ahead of Apple or Amazon. The agency has at its disposal voice recognition technology that it employs to identify terrorists, government spies, or anyone they choose -- with just a phone call, according to a report by The Intercept. The disclosure was revealed in a recently published article, part of a trove of documents leaked by former NSA contractor Edward Snowden. The publication wrote that by using recorded audio, the NSA is able to create a "voiceprint," or a map of qualities that mark a voice as singular, and identify the person speaking. The documents also suggest the agency is continuously improving its speech recognition capabilities, the publication noted.
Lee Kai-Fu has always been very bullish about the future of artificial intelligence (AI) in China. He started off his keynote speech at an AI conference at the Massachusetts Institute of Technology in November by predicting that self-driving cars will become a mass phenomenon in the U.S. in 15 to 20 years. But in China, he said, it will take "more like 10 years." "Although there are concerns about whether there is an emerging AI bubble in China, I'd say there isn't one," he told Caixin. Lee is a real insider when it comes to assessing the state of AI development in both North America and China. He completed his doctorate in computer-aided speech recognition at Carnegie-Mellon University (CMU) in 1988 and went on to work at Apple Inc., Silicon Graphics Inc. and Microsoft Corp., and head Google Inc.'s China business.
Google is adding a multitude of languages to its speech recognition capabilities. The expansion will cover 30 international languages and local dialects, including India and Africa's emerging regions. This will increase the total number of supported languages to 119. The update will include eight additional Indian languages and two African languages, Amharic and Swahili. The expanded speech recognition will provide more voice-based search opportunities, whether searching the web or typing using one's voice.
Rokid, a Chinese startup that makes an AI voice assistant and smart devices, just raised a Series B extension round led by Temasek Holdings, with participation from Credit Suisse, IDG Capital and CDIB Capital. The size of the round was not released, but a source familiar with the deal told TechCrunch that it is $100 million. The company's previous funding was its Series B round, which was announced in November 2016. Founder and chief executive officer Mingming Zhu says Rokid raised a Series B instead of a C round because the company, which is based in Hangzhou, China with research centers in Beijing and San Francisco that develop its proprietary natural language processing, image processing, face recognition and robotics technology, is still in its early stages. Rokid wants to focus on gathering more resources and bringing in strategic investors like Temasek Holdings before moving on to a Series C.
Google announced on Thursday the launch of its AutoML Vision, taking its AI easy approach one step ahead. The Cloud AutoML is a tool that will allow developers with limited machine learning (ML) expertise to train custom image recognition models, without having to write any code. Google's AutoML initiative was first announced at the company's I/O conference last year. The service, for now, is focused only at image recognition, however, Google plans to expand it to other services for all major fields of AI, i.e. speech, translation, video, natural language recognition. The Cloud AutoML allows anybody to train their model just by uploading their images, tagging them and then having Google's AutoML to develop a customer ML model.
Four miles on the Vegas Strip, the latest gadgetry, some 4,000 vendors, 170,000 attendees, 7,000 media, three days of sessions, not including the pre-show briefings, backroom meetings and off-site soirees where the secret stuff goes down -- what, if anything, does the world's biggest consumer tech show mean to CIOs? Should a CIO, for goodness' sake, care that the Numi intelligent toilet by Kohler Co. -- a CES 2018 Innovation Awards Honoree -- has a voice-controlled toilet lid lifter and seat warmer, among other more intimate services? Or that the Kohler Konnect Verdera Voice Lighted Mirror is the world's first bathroom mirror with Amazon's AI voice assistant, Alexa? Isaac Sacolick, who's been a CIO at Businessweek and McGraw-Hill Construction and is now president and CIO at New York-based consulting firm StarCIO, believes so. "The Kohler Konnect mirror was probably one of the more interesting voice assistants I looked at," said Sacolick, who, like SearchCIO, monitored the event remotely.
Google today announced the alpha launch of AutoML Vision, a new service that helps developers -- including those with no machine learning (ML) expertise -- build custom image recognition models. While Google plans to expand this custom ML model builder under the AutoML brand to other areas, the service for now only supports computer vision models, but you can expect the company to launch similar versions of AutoML for all the standard ML building blocks in its repertoire (think speech, translation, video, natural language recognition, etc.). The basic idea here, Google says, is to allow virtually anybody to bring their images, upload them (and import their tags or create them in the app) and then have Google's systems automatically create a customer machine learning model for them. The company says that Disney, for example, has used this system to make the search feature in its online store more robust because it can now find all the products that feature a likeness of Lightning McQueen and not just those where your favorite talking race car was tagged in the text description. The whole process, from importing data to tagging it and training the model, is done through a drag and drop interface.
Google has plunged high towards its'AI-first' dream. The tech giant has attempted to develop a Text-to-speech system that has exactly human-like articulation. This AI system is called "Tacotron 2" that has the ability to give an AI-generated computer speech in a human-voice. Google researchers mentioned in the blog post that the new procedure does not utitilise complex linguistic and acoustic features as input. In place of it, they developed human-like speech from text using neural networks trained using only speech examples and corresponding text transcript.
When Gang Xu, a 46-year-old Beijing resident, needs to communicate with his Canadian tenant about rent payments or electricity bills, he opens an app called iFlytek Input in his smartphone and taps an icon that looks like a microphone, and then begins talking. The software turns his Chinese verbal messages into English text messages, and sends them to the Canadian tenant. In China, over 500 million people use iFlytek Input to overcome obstacles in communication such as the one Xu faces. Some also use it to send text messages through voice commands while driving, or to communicate with a speaker of another Chinese dialect. The app was developed by iFlytek, a Chinese AI company that applies deep learning in a range of fields such as speech recognition, natural-language processing, machine translation, and data mining (see "50 Smartest Companies 2017").
I was traveling last fall with my boss, and we began to talk about our upcoming conference day around artificial intelligence. We came to the topic of machine learning, and I mentioned, half jokingly, that theoretically all web accessibility barriers could be automatically resolved through the proper application of machine learning. Since that day I've continued to consider the challenge, and have come across a few articles that have reinforced the theory. As our Q1 conference day is soon coming, I've decided to take a few minutes to share my thoughts. For those who don't know, web accessibility is the practice of making content and applications ("web content") accessible to those with a variety of disabilities, "...including blindness and low vision, deafness and hearing loss, learning disabilities, cognitive limitations, limited movement, speech disabilities, [and] photosensitivity" (https://www.w3.org/TR/WCAG20/).