Collaborating Authors


Musicians ask Spotify to publicly abandon controversial speech recognition patent


At the start of the year, Spotify secured a patent for a voice recognition system that could detect the "emotional state," age and gender of a person and use that information to make personalized listening recommendations. As you might imagine, the possibility that the company was working on a technology like that made a lot of people uncomfortable, including digital rights non-profit Access Now. At the start of April, the organization sent Spotify a letter calling on it to abandon the tech. After Spotify privately responded to those concerns, Access Now, along with several other groups and a collection of more than 180 musicians, are asking the company to publicly commit to never using, licensing, selling or monetizing the system it patented. Some of the individuals and bands to sign the letter include Rage Against the Machine guitarist Tom Morello, rapper Talib Kweli and indie group DIIV.

'Hey Spotify, play Up First:' Two weeks with Car Thing


After years of rumors, confirmation and vague descriptions, Spotify has finally made its first piece of hardware available to select users. Even though the company revealed the full details on Car Thing earlier this month, it's only a "limited release" right now. I've spent two weeks with Car Thing in my car (obviously), and can tell you one thing -- this dedicated Spotify player is really more of a controller for the app on your phone. Spotify first tipped its hand on an in-car music player in 2018. It offered a few Reddit users the opportunity to try a compact device that reportedly featured voice control and 4G connectivity.

Spotify launches voice-controlled 'Car Thing'


Whether your road trip soundtracks consist of music, news, entertainment, or talk, Spotify's Car Thing has you covered. The new smart player, currently available to select users in the U.S., puts your audio library just a voice command, tap, turn, or swipe away. "Car Thing enables you to play your favorite audio faster, so you're already listening to that hit song or the latest podcast episode before you've even pulled out of the driveway," according to a Spotify blog announcement. "Switching between your favorite audio is effortless, allowing you to shift gears to something else as soon as the mood strikes." You will need a Spotify Premium account to use Car Thing, but setup is simple: plug the device into a 12-volt power outlet, sync it with your smartphone (iOS 14 and Android 8 or above), and connect your phone to the vehicle's stereo.

Spotify's voice-controlled 'Car Thing' is available for some subscribers


At this point, we've seen rumors, job listings, blog posts, FCC filings and more rumors about Spotify's in-car music player over the span of a few years. In fact, I was convinced it would never become a thing the public could actually use. When the company first revealed a piece of hardware called "Car Thing" in 2019, Spotify was clear the test was meant "to help us learn more about how people listen to music and podcasts." It also explained that there weren't "any current plans" to make that device available to consumers. Now Spotify is ready for select users to get their hands on a refined version of the voice-controlled in-car player.

Spotify rolls out its own hands-free voice assistant on iOS and Android


Spotify users on iOS and Android have another way to quickly play something. The audio streaming service has an in-app voice assistant you can operate hands free, building on the existing voice search function. After saying the "Hey, Spotify" wake word, you can ask the app to fire up a song or playlist or play music from a certain artist. You'll need to grant Spotify permission to access your microphone if you want to use the feature, which you can switch on from the voice interactions section of the menu. As GSM Arena notes, Spotify's privacy policy states that the service only stores recordings and transcriptions of your searches after you say the wake word or tap the voice button.

The Use of Voice Source Features for Sung Speech Recognition Artificial Intelligence

In this paper, we ask whether vocal source features (pitch, shimmer, jitter, etc) can improve the performance of automatic sung speech recognition, arguing that conclusions previously drawn from spoken speech studies may not be valid in the sung speech domain. We first use a parallel singing/speaking corpus (NUS-48E) to illustrate differences in sung vs spoken voicing characteristics including pitch range, syllables duration, vibrato, jitter and shimmer. We then use this analysis to inform speech recognition experiments on the sung speech DSing corpus, using a state of the art acoustic model and augmenting conventional features with various voice source parameters. Experiments are run with three standard (increasingly large) training sets, DSing1 (15.1 hours), DSing3 (44.7 hours) and DSing30 (149.1 hours). Pitch combined with degree of voicing produces a significant decrease in WER from 38.1% to 36.7% when training with DSing1 however smaller decreases in WER observed when training with the larger more varied DSing3 and DSing30 sets were not seen to be statistically significant. Voicing quality characteristics did not improve recognition performance although analysis suggests that they do contribute to an improved discrimination between voiced/unvoiced phoneme pairs.

DEEPF0: End-To-End Fundamental Frequency Estimation for Music and Speech Signals Artificial Intelligence

We propose a novel pitch estimation technique called DeepF0, which leverages the available annotated data to directly learns from the raw audio in a data-driven manner. F0 estimation is important in various speech processing and music information retrieval applications. Existing deep learning models for pitch estimations have relatively limited learning capabilities due to their shallow receptive field. The proposed model addresses this issue by extending the receptive field of a network by introducing the dilated convolutional blocks into the network. The dilation factor increases the network receptive field exponentially without increasing the parameters of the model exponentially. To make the training process more efficient and faster, DeepF0 is augmented with residual blocks with residual connections. Our empirical evaluation demonstrates that the proposed model outperforms the baselines in terms of raw pitch accuracy and raw chroma accuracy even using 77.4% fewer network parameters. We also show that our model can capture reasonably well pitch estimation even under the various levels of accompaniment noise.

Exploring the implications of AI with Mastercard's AI Garage - ideaXme


Artificial intelligence has become a technological buzzword, often solely referred to AI rather than depicting the possibly infinite amount of practical applications that artificial intelligence can actually provide, or the intricacies involved from industry to industry, and region to region. To discuss some of the many applications for artificial intelligence, as well as some of the considerations to be taken into account to create more accurate and less biased machine learning systems, I had the pleasure of speaking with Nitendra Rajput, VP and Head of Mastercard's AI Garage. Nitendra Rajput is the Vice President and Head of Mastercard's AI Garage, setting up the centre to enable it to solve problems across various business verticals globally with machine learning processes, increasing efficiencies across the business as well as mitigating instances of fraud. Nitendra has over 20 years experience working in the fields artificial intelligence, machine learning, and mobile interactions, after realising a gap in the market for developing speech recognition systems for vocally-led countries, such as India. Prior to Mastercard's AI Garage, he spent 18 years at IBM Research, working on different aspects of machine learning, human-computer interaction, software engineering and mobile sensing.

Computer-Generated Music for Tabletop Role-Playing Games Machine Learning

In this paper we present Bardo Composer, a system to generate background music for tabletop role-playing games. Bardo Composer uses a speech recognition system to translate player speech into text, which is classified according to a model of emotion. Bardo Composer then uses Stochastic Bi-Objective Beam Search, a variant of Stochastic Beam Search that we introduce in this paper, with a neural model to generate musical pieces conveying the desired emotion. We performed a user study with 116 participants to evaluate whether people are able to correctly identify the emotion conveyed in the pieces generated by the system. In our study we used pieces generated for Call of the Wild, a Dungeons and Dragons campaign available on YouTube. Our results show that human subjects could correctly identify the emotion of the generated music pieces as accurately as they were able to identify the emotion of pieces written by humans.

Jukebox: A Generative Model for Music Machine Learning

We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples at, along with model weights and code at