We've created MuseNet, a deep neural network that can generate 4-minute musical compositions with 10 different instruments, and can combine styles from country to Mozart to the Beatles. MuseNet was not explicitly programmed with our understanding of music, but instead discovered patterns of harmony, rhythm, and style by learning to predict the next token in hundreds of thousands of MIDI files. MuseNet uses the same general-purpose unsupervised technology as GPT-2, a large-scale transformer model trained to predict the next token in a sequence, whether audio or text. Since MuseNet knows many different styles, we can blend generations in novel ways. Here the model is given the first 6 notes of a Chopin Nocturne, but is asked to generate a piece in a pop style with piano, drums, bass, and guitar.
In this paper, we learn disentangled representations of timbre and pitch for musical instrument sounds. We adapt a framework based on variational autoencoders with Gaussian mixture latent distributions. Specifically, we use two separate encoders to learn distinct latent spaces for timbre and pitch, which form Gaussian mixture components representing instrument identity and pitch, respectively. For reconstruction, latent variables of timbre and pitch are sampled from corresponding mixture components, and are concatenated as the input to a decoder. We show the model efficacy by latent space visualization, and a quantitative analysis indicates the discriminability of these spaces, even with a limited number of instrument labels for training. The model allows for controllable synthesis of selected instrument sounds by sampling from the latent spaces. To evaluate this, we trained instrument and pitch classifiers using original labeled data. These classifiers achieve high accuracy when tested on our synthesized sounds, which verifies the model performance of controllable realistic timbre and pitch synthesis. Our model also enables timbre transfer between multiple instruments, with a single autoencoder architecture, which is evaluated by measuring the shift in posterior of instrument classification. Our in depth evaluation confirms the model ability to successfully disentangle timbre and pitch.
Automatic speaker verification, like every other biometric system, is vulnerable to spoofing attacks. Using only a few minutes of recorded voice of a genuine client of a speaker verification system, attackers can develop a variety of spoofing attacks that might trick such systems. Detecting these attacks using the audio cues present in the recordings is an important challenge. Most existing spoofing detection systems depend on knowing the used spoofing technique. With this research, we aim at overcoming this limitation, by examining robust audio features, both traditional and those learned through an autoencoder, that are generalizable over different types of replay spoofing. Furthermore, we provide a detailed account of all the steps necessary in setting up state-of-the-art audio feature detection, pre-, and postprocessing, such that the (non-audio expert) machine learning researcher can implement such systems. Finally, we evaluate the performance of our robust replay speaker detection system with a wide variety and different combinations of both extracted and machine learned audio features on the `out in the wild' ASVspoof 2017 dataset. This dataset contains a variety of new spoofing configurations. Since our focus is on examining which features will ensure robustness, we base our system on a traditional Gaussian Mixture Model-Universal Background Model. We then systematically investigate the relative contribution of each feature set. The fused models, based on both the known audio features and the machine learned features respectively, have a comparable performance with an Equal Error Rate (EER) of 12. The final best performing model, which obtains an EER of 10.8, is a hybrid model that contains both known and machine learned features, thus revealing the importance of incorporating both types of features when developing a robust spoofing prediction model.
Artificially intelligent systems are slowly taking over tasks previously done by humans, and many processes involving repetitive, simple movements have already been fully automated. In the meantime, humans continue to be superior when it comes to abstract and creative tasks. However, it seems like even when it comes to creativity, we're now being challenged by our own creations. In the last few years, we've seen the emergence of hundreds of "AI artists." These complex algorithms are creating unique (and sometimes eerie) works of art.
New technology devices and apps pop up as abundantly as summer weeds here in Silicon Valley. Chip-enhanced products offer to satisfy almost every need imaginable. Prompts from your smart refrigerator tell you to buy more milk. With a voice command, music plays to facilitate meditation, thanks to your smart -- always on -- helper who listens for your next query from a canister on your kitchen counter; you know, the one with a woman's voice and name. In this glut of offerings, how do you select what is truly useful from what is simply the latest "smart" thing?
Today is the slowest business will ever run. In today's world, it is the small companies that beat the large enterprises with speed and agility, according to Valerie Blatt – Global Vice President SAP. Companies such as Spotify are forcing incumbents such as Apple to change their music business. Therefore, for the organisation of tomorrow to remain relevant, they need faster, better and more intelligence. Last week, I was invited to join SAP Ariba Live in Barcelona.
"I have songwriting credits…even though I don't know how to write a song." 1 The speaker of this statement is not a musician and has no musical training. He helped create an app called Endel, which is self-described as "a cross-platform audio ecosystem." 2 Endel is part of a larger part of the current hot debate over works of art being "created" by computers using programs employing "artificially intelligent" modes of computer learning, or AI for short. "Dmitry Evgrafov, Endel's composer and head of sound design, says all 600 tracks were made'with a click of a button.' There was minimal human involvement outside of chopping up the audio and mastering it for streaming. Endel even hired a third-party company to write the track titles." 3 What makes this notable is that Endel has a record deal with Warner Bros. Music. 4 "Five Endel albums have already been released, and 15 more are coming this year -- all of which will be generated by code. In the future, Endel will be able to make infinite ambient tracks." 5 But what makes this problematic, is that there is serious doubt as to whether the output of Endel is capable of copyright protection at all.
Regardless of which piece of visual media or literature first instilled thoughts of artificial intelligence in our minds, it had quite the effect on us. Despite AI being very much a part of our current society, many of us don't realize it's here. Instead we're fixed on dystopian futures and malevolent machines. In reality, AI is disrupting dozens of sectors. From healthcare and transportation to fintech and telecommunications -- more than 154,000 AI patents have been filed since 2010 alone.
Analogy is a key solution to automated music generation, featured by its ability to generate both natural and creative pieces based on only a few examples. In general, an analogy is made by partially transferring the music abstractions, i.e., high-level representations and their relationships, from one piece to another; however, this procedure requires disentangling music representations, which takes little effort for musicians but is non-trivial for computers. Three sub-problems arise: extracting latent representations from the observation, disentangling the representations so that each part has a unique semantic interpretation, and mapping the latent representations back to actual music. An explicitly-constrained conditional variational auto-encoder (EC2-VAE) is proposed as a unified solution to all three sub-problems. In this study, we focus on disentangling the pitch and rhythm representations of 8-beat music clips conditioned on chords. In producing music analogies, this model helps us to realize the imaginary situation of "what if" a piece is composed using a different pitch contour, rhythm pattern, chord progression etc., by borrowing the representations from other pieces. Finally, we validate the proposed disentanglement method using objective measurements and evaluate the analogy examples by a subjective study.
Automated facial recognition poses one of the greatest threats to individual freedom and should be banned from use in public spaces, according to the director of the campaign group Liberty. Martha Spurrier, a human rights lawyer, said the technology had such fundamental problems that, despite police enthusiasm for the equipment, its use on the streets should not be permitted. She said: "I don't think it should ever be used. It is one of, if not the, greatest threats to individual freedom, partly because of the intimacy of the information it takes and hands to the state without your consent, and without even your knowledge, and partly because you don't know what is done with that information." Police in England and Wales have used automated facial recognition (AFR) to scan crowds for suspected criminals in trials in city centres, at music festivals, sports events and elsewhere.