They can still have trouble understanding simple commands to play music or look up directions, though, especially in noisy places. Rather than focusing on cleaning up the audio signal that captures your voice, Israeli startup VocalZoom thinks it might be possible to make all kinds of speech-recognition applications work a lot better by using a tiny, low-power laser that measures the itty-bitty vibrations of your skin when you speak. The company, which has raised about 12.5 million in venture funding thus far, is building a sensor with a small laser that it says will initially be built into headsets and helmets; there, it will be used alongside existing speech-recognition technologies that rely on microphones in order to reduce overall misunderstandings. VocalZoom founder and CEO Tal Bakish thinks it will first be used for things like motorcycle helmets or headsets worn by warehouse workers--you might use it to ask for directions while riding your Harley, for instance. A Chinese speech-recognition company called iFlytek plans to have a prototype headset ready at the end of August.
The headline story here is that for the first time a system has been developed that exceeds human performance in one of the most difficult of all human speech recognition tasks: natural conversations held over the telephone. This is known as conversational telephone speech, or CTS. The reference datasets for this task are the Switchboard and Fisher data collections from the 1990s and early 2000s. The apocryphal story here is that human performance on the task is about 4% error rate. But no-one can quite pin down where that 4% number comes from.
A new milestone in human speech recognition has been reached by Microsoft, matching the accuracy of trained human transcribers. The firm's software, used in its Cortana voice assistant, has achieved a 5.1 per cent margin of error, putting it on a par with professionals. One of the big frustrations of voice recognition has been getting machines to accept commands, a process which often involves repetition and exaggerated speech. The development means the company's products will soon accept orders with super-human precision. A new milestone in human speech recognition has been reached by Microsoft, matching the accuracy of trained human transcribers.
Depending on whom you ask, humans miss one to two words out of every 20 they hear. Imagine, though, how difficult it is for a computer? Last year, IBM announced a major milestone in conversational speech recognition: a system that achieved a 6.9 percent word error rate. Since then, we have continued to push the boundaries of speech recognition, and today we've reached a new industry record of 5.5 percent. This was measured on a very difficult speech recognition task: recorded conversations between humans discussing day-to-day topics like "buying a car."
In this short paper, we introduce the concept of "auditory perspective taking" and discuss its nature and utility in aural interactions between people. We then go on to describe an integrated range of techniques motivated by this idea we have developed for improving the success of robotic speech presentations for individual human user/listeners in relevant circumstances.