They can still have trouble understanding simple commands to play music or look up directions, though, especially in noisy places. Rather than focusing on cleaning up the audio signal that captures your voice, Israeli startup VocalZoom thinks it might be possible to make all kinds of speech-recognition applications work a lot better by using a tiny, low-power laser that measures the itty-bitty vibrations of your skin when you speak. The company, which has raised about 12.5 million in venture funding thus far, is building a sensor with a small laser that it says will initially be built into headsets and helmets; there, it will be used alongside existing speech-recognition technologies that rely on microphones in order to reduce overall misunderstandings. VocalZoom founder and CEO Tal Bakish thinks it will first be used for things like motorcycle helmets or headsets worn by warehouse workers--you might use it to ask for directions while riding your Harley, for instance. A Chinese speech-recognition company called iFlytek plans to have a prototype headset ready at the end of August.
The headline story here is that for the first time a system has been developed that exceeds human performance in one of the most difficult of all human speech recognition tasks: natural conversations held over the telephone. This is known as conversational telephone speech, or CTS. The reference datasets for this task are the Switchboard and Fisher data collections from the 1990s and early 2000s. The apocryphal story here is that human performance on the task is about 4% error rate. But no-one can quite pin down where that 4% number comes from.
A new milestone in human speech recognition has been reached by Microsoft, matching the accuracy of trained human transcribers. The firm's software, used in its Cortana voice assistant, has achieved a 5.1 per cent margin of error, putting it on a par with professionals. One of the big frustrations of voice recognition has been getting machines to accept commands, a process which often involves repetition and exaggerated speech. The development means the company's products will soon accept orders with super-human precision. A new milestone in human speech recognition has been reached by Microsoft, matching the accuracy of trained human transcribers.
Depending on whom you ask, humans miss one to two words out of every 20 they hear. Imagine, though, how difficult it is for a computer? Last year, IBM announced a major milestone in conversational speech recognition: a system that achieved a 6.9 percent word error rate. Since then, we have continued to push the boundaries of speech recognition, and today we've reached a new industry record of 5.5 percent. This was measured on a very difficult speech recognition task: recorded conversations between humans discussing day-to-day topics like "buying a car."
IBM researchers have set a milestone in conversational speech recognition by achieving a new industry record of a 5.5 percent word error rate, surpassing its previous record of 6.9 percent, according to the company's blog post. The researchers conducted a difficult speech recognition task to achieve this record, where they recorded conversations between humans discussing typical everyday topics like "buying a car." This recorded corpus, titled "SWITCHBOARD", has been used for over two decades to benchmark speech recognition systems. To achieve the 5.5 percent record, the researchers focused on extending the company's application of deep learning technologies by combining LSTM (Long Short Term Memory) and WaveNet language models with three strong acoustic models. The first two models were six-layer bidirectional LSTMs, with one of the models being equipped with multiple feature inputs and the other being trained with speaker-adversarial multi-task learning.