Speech encompasses speech understanding/recognition and speech synthesis.
David Borish is Chief Creative at PRIMO AI, a New York startup that recommends the highest performing Speech-to-text (STT) and Natural Language Understanding (NLU) services for a particular dataset and geographical region. We discover what the biggest problem with speech to text systems is today, and why trying to solve it by hiring data scientists can be prohibitively expensive. We also discuss the advantages of acquiring a technology patent, why David chose to recently enter the voice space, and the approach he takes when selecting his next entrepreneurial challenge. David is a seasoned startup veteran who believes passionately in the future of voice, and our conversation contains many valuable lessons to take away.
An automatic-speech-recognition system -- such as Alexa's -- converts speech into text, and one of its key components is its language model. Given a sequence of words, the language model computes the probability that any given word is the next one. For instance, a language model would predict that a sentence that begins "Toni Morrison won the Nobel" is more likely to conclude "Prize" than "dries". Language models can thus help decide between competing interpretations of the same acoustic information. Conventional language models are n-gram based, meaning that they model the probability of the next word given the past n-1 words.
Since Siri debuted on the iPhone 4s back in 2011, voice assistants have gone from unworkable gimmick to the basis for smart speaker technology found in one in six American homes. "Before Siri, when I talked about [what I do] there were blank stares," Tom Hebner, head of innovation at Nuance Communications, which develops cutting edge A.I. voice technology, told Digital Trends. "People would say, 'Do you build those horrible phone systems? That was one group of people's only interaction with voice technology." According to eMarketer forecasts, almost 100 million smartphone users will be using voice assistants by 2020.
That's what companies like Salesforce are expecting as they invest in technology like Einstein Voice Assistant to help make it even easier for sales staff to track, message, update, and notify their teams about relevant customer-oriented data. And you can be sure that the likes of Microsoft Dynamics, SAP and other other CRM leaders will follow closely with this capability in the coming year as voice technology picks up speed. But what do marketers and sales leaders need to know about this advancement? How will their work be impacted by voice technology and CRM? The short answer: voice is about to shape marketing and customer experience in big ways.
At Microsoft, our mission is to empower every person and organization on the planet to achieve more. The media industry exemplifies this mission. We live in an age where more content is being created and consumed in more ways and on more devices than ever. At IBC 2019, we're delighted to share the latest innovations we've been working on and how they can help transform your media workflows. Read on to learn more, or join our product teams and partners at Hall 1 Booth C27 at the RAI in Amsterdam from September 13th to 17th.
ICYMI: Earlier this summer we broke new ground with RealTalk, a speech synthesis system created by Machine Learning Engineers at Dessa. With their AI-powered text-to-speech system, the team managed to replicate the voice of Joe Rogan, a podcasting legend known for his irreverent takes on consciousness, sports and technology. On top of that, their recreation of Rogan's voice is the most realistic AI voice that's been released to date. If you haven't heard the voice yet, you should. Here's the video we shared on YouTube featuring a medley of their faux Rogan's musings: Since then, the public's response to the work has wowed us.
McDonald's announced it will McBuy the Bay Area voice-recognition startup Apprente for an undisclosed amount. According to McDonald's, Apprente's "sound-to-meaning" technology handles "complex, multilingual, multi-accent and multi-item conversational ordering," and believes the technology will help streamline the drive-thru process -- even faster food, you say?? As the earth turns and the centuries change, so does the way people wish to order a Big Mac, and Micky D's has the cash to listen. Back in March, the company bought Dynamic Yield, which customizes drive-thru menus based on factors like weather, time of day, and customer order profiles. A month later, it invested in New Zealand app-designer Plexure, which will help connect customers to its new smart drive-thrus, among other things.
AI-powered synthetic brains will allow humans to operate 500 versions of themselves at once, according to the man behind Amazon's voice assistant. Igor Jablokov believes artificial intelligence will become so advanced we will be unable to distinguish between a real or "synthetic" mind. The CEO of Pryon previously founded Yap, a fully-automated cloud platform for voice recognition, which was snapped up by Amazon before being used for the popular Alexa. The device uses a non-human voice to communicate with users, but Igor warns such technology could change with terrifying consequences. He told the Financial Times: "People will not be able to tell if they are interacting with you or your AI proxy. "Right now, you could be doing two interviews at once.
McDonald's has wolfed down Apprente, an AI startup focused on voice recognition. One of America's biggest fast-food chains wants to get its greasy hands on machine learning. Apprente, based in Mountain View, California, was founded in 2017, and has been building speech-powered customer-service chatbots. Now, the team will be rebranded as McD Tech Labs, and will slap their technology into McDonald's Drive Thru service. "The initial focus of the Silicon Valley team will be to enhance technology for use in McDonald's Drive Thru," gushed the McFlurry giant in a statement.
A text-to-speech system, which converts written text into synthesized speech, is what allows Alexa to respond verbally to requests or commands. Through a service called Amazon Polly, text-to-speech is also a technology that Amazon Web Services offers to its customers. Last year, both Alexa and Polly evolved toward neural-network-based text-to-speech systems, which synthesize speech from scratch, rather than the earlier unit-selection method, which strung together tiny snippets of pre-recorded sounds. In user studies, people tend to find speech produced by neural text-to-speech (NTTS) systems more natural-sounding than speech produced by unit selection. But the real advantage of NTTS is its adaptability, something we demonstrated last year in our work on changing the speaking style ("newscaster" versus "neutral") of an NTTS system.