Natural Language Processing (NLP), the ability of a software program to understand human language as it is spoken, has seen major breakthroughs, thanks to Artificial Intelligence (AI) and improved access to fast processors and cloud computing. With the introduction of more personal assistants, better smartphone functionality, and the evolution of Big Data to automate even more regular human jobs, NLP adoption is projected to gain up steam in the future years. SoundHound creates AI and conversational intelligence systems that are voice-enabled. It offers a Speech-to-Meaning engine as well as Deep Meaning Understanding technology, which may be integrated into other services and devices. It also creates music recognition apps and voice search assistants.
Listen to this episode from Tcast on Spotify. Voice recognition software is getting better and better. Once upon a time, it was something that was extremely clunky and unreliable and even the best systems required you to spend far too much time training them while speaking extremely slowly and enunciating like…well…like a computer. However, at last, the systems have improved to the point where it’s possible to at least accurately convey meaning through talk to text features without having to clarify every other word. In fact, I know a trucker who does most of his communication using talk to text on his phone. It makes a few mistakes here and there but its accuracy is still pretty impressive considering he’s speaking normally while in a large moving vehicle. Then there are the voice assistants on our phones. Whether you talk to Siri, Alexa, or Cortana (all four of you, you know who you are) that voice recognition starts out needing a little training but nothing like it used to. And the more you use it to look up local restaurants, find a factoid to settle an argument or to book a hotel room, the more accurate it gets. Now, they are even in the homes of many, listening constantly for you to need their assistance with something – everything from dimming the lights to spinning up your favorite playlist on Spotify. The improvements in this software hold a lot of potential. It has already been used for years in business to accommodate certain employees who may not be able to speak clearly or who lose the use of their arms. It is also a much more efficient way to record information than the increasingly dated keyboard. Typing is inherently inefficient, creating the possibility for misspellings that need to be corrected lest they convey an unintended meaning. It also requires a keyboard, which adds space, weight and money to your computer. As voice recognition software improves, the keyboard can be replaced with a simple microphone, probably the one on your phone. Imagine being able to compose reliable messages for business, a book, notes on a law case and have them all transcribed without having to take the time to proofread them. The time savings would be impressive. Or perhaps a more mundane situation in which you’re sitting at home and have a craving for pizza, but you can’t quite remember the name of the place you got it from last month. You throw the question out into the air and your device reminds you of the name, the price and asks you if you’d like it to order a pizza for you. If you think about it, Alexa and other smart devices are only a step or two away from that level of functionality. Another use would be in hospitals. Embedded microphones would record conversations with your doctor, highlighting the important points and recording all of the important information. This would save time and increase efficiency in a number of ways. No more would nurses and admins have to spend hours on data entry, with all the potential transcription errors that entail. Incidentally, that would also save you having to answer the same questions three times every time you go in for a checkup. It also means no one, or at least very few people have to come in contact with the Petri dishes known as keyboards in an environment that should be kept as sterile as possible. Lectures and presentations could be recorded and transcribed instantly, making information readily available in real time. The possibilities are enormous. Yet, there are potential problems that arise, namely, who owns all that data getting generated and recorded? Is it the place where the recording happens? The place where they are stored? Some other party? At TARTLE, we believe all the data you generate is yours. So if it’s your information and your data that is being recorded, then you deserve to be the primary beneficiary of sharing it, or of deciding whether you want to share that data or not. These are questions that will be addressed sooner or later in the legislative realm which is why we are encouraging people to sign up at tartle.co to join the TARTLE movement. Together we can help steer that eventual legislation in a direction that will benefit not just a few, but each person who works to generate that data in the first place. What’s your data worth? www.tartle.co
At the start of the year, Spotify secured a patent for a voice recognition system that could detect the "emotional state," age and gender of a person and use that information to make personalized listening recommendations. As you might imagine, the possibility that the company was working on a technology like that made a lot of people uncomfortable, including digital rights non-profit Access Now. At the start of April, the organization sent Spotify a letter calling on it to abandon the tech. After Spotify privately responded to those concerns, Access Now, along with several other groups and a collection of more than 180 musicians, are asking the company to publicly commit to never using, licensing, selling or monetizing the system it patented. Some of the individuals and bands to sign the letter include Rage Against the Machine guitarist Tom Morello, rapper Talib Kweli and indie group DIIV.
After years of rumors, confirmation and vague descriptions, Spotify has finally made its first piece of hardware available to select users. Even though the company revealed the full details on Car Thing earlier this month, it's only a "limited release" right now. I've spent two weeks with Car Thing in my car (obviously), and can tell you one thing -- this dedicated Spotify player is really more of a controller for the app on your phone. Spotify first tipped its hand on an in-car music player in 2018. It offered a few Reddit users the opportunity to try a compact device that reportedly featured voice control and 4G connectivity.
Whether your road trip soundtracks consist of music, news, entertainment, or talk, Spotify's Car Thing has you covered. The new smart player, currently available to select users in the U.S., puts your audio library just a voice command, tap, turn, or swipe away. "Car Thing enables you to play your favorite audio faster, so you're already listening to that hit song or the latest podcast episode before you've even pulled out of the driveway," according to a Spotify blog announcement. "Switching between your favorite audio is effortless, allowing you to shift gears to something else as soon as the mood strikes." You will need a Spotify Premium account to use Car Thing, but setup is simple: plug the device into a 12-volt power outlet, sync it with your smartphone (iOS 14 and Android 8 or above), and connect your phone to the vehicle's stereo.
At this point, we've seen rumors, job listings, blog posts, FCC filings and more rumors about Spotify's in-car music player over the span of a few years. In fact, I was convinced it would never become a thing the public could actually use. When the company first revealed a piece of hardware called "Car Thing" in 2019, Spotify was clear the test was meant "to help us learn more about how people listen to music and podcasts." It also explained that there weren't "any current plans" to make that device available to consumers. Now Spotify is ready for select users to get their hands on a refined version of the voice-controlled in-car player.
In this paper, we ask whether vocal source features (pitch, shimmer, jitter, etc) can improve the performance of automatic sung speech recognition, arguing that conclusions previously drawn from spoken speech studies may not be valid in the sung speech domain. We first use a parallel singing/speaking corpus (NUS-48E) to illustrate differences in sung vs spoken voicing characteristics including pitch range, syllables duration, vibrato, jitter and shimmer. We then use this analysis to inform speech recognition experiments on the sung speech DSing corpus, using a state of the art acoustic model and augmenting conventional features with various voice source parameters. Experiments are run with three standard (increasingly large) training sets, DSing1 (15.1 hours), DSing3 (44.7 hours) and DSing30 (149.1 hours). Pitch combined with degree of voicing produces a significant decrease in WER from 38.1% to 36.7% when training with DSing1 however smaller decreases in WER observed when training with the larger more varied DSing3 and DSing30 sets were not seen to be statistically significant. Voicing quality characteristics did not improve recognition performance although analysis suggests that they do contribute to an improved discrimination between voiced/unvoiced phoneme pairs.
We propose a novel pitch estimation technique called DeepF0, which leverages the available annotated data to directly learns from the raw audio in a data-driven manner. F0 estimation is important in various speech processing and music information retrieval applications. Existing deep learning models for pitch estimations have relatively limited learning capabilities due to their shallow receptive field. The proposed model addresses this issue by extending the receptive field of a network by introducing the dilated convolutional blocks into the network. The dilation factor increases the network receptive field exponentially without increasing the parameters of the model exponentially. To make the training process more efficient and faster, DeepF0 is augmented with residual blocks with residual connections. Our empirical evaluation demonstrates that the proposed model outperforms the baselines in terms of raw pitch accuracy and raw chroma accuracy even using 77.4% fewer network parameters. We also show that our model can capture reasonably well pitch estimation even under the various levels of accompaniment noise.
The very mention of Artificial Intelligence reminds most people of movies like the Terminator but in actuality, AI is already very present in our daily lives making things much easier for us in a multitude of fields. For example, according to a Harvard Business Review study, companies that were using AI for sales managed to bring in 50% more leads and reduce their costs by 40%-60%. AI applications are not necessarily actual robots walking about the office. In most cases, it means the introduction of software and tools that make conducting business easier, more affordable, and faster by automating as much as possible. Mathematician Alan Turing was the first to really ask the question'Can machines think?'.