Collaborating Authors

Hawthorne, Curtis

Symbolic Music Generation with Diffusion Models Machine Learning

Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizing the discrete domain in the continuous latent space of a pre-trained variational autoencoder. Our method is non-autoregressive and learns to generate sequences of latent embeddings through the reverse process and offers parallel generation with a constant number of iterative refinement steps. We apply this technique to modeling symbolic music and show strong unconditional generation and post-hoc conditional infilling results compared to autoregressive language models operating over the same continuous embeddings.

Encoding Musical Style with Transformer Autoencoders Machine Learning

A BSTRACT We consider the problem of learning high-level controls over the global structure of sequence generation, particularly in the context of symbolic music generation with complex language models. In this work, we present the Transformer au-toencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance. We show it is possible to combine this global embedding with other temporally distributed embeddings, enabling improved control over the separate aspects of performance style and and melody. Empirically, we demonstrate the effectiveness of our method on a variety of music generation tasks on the MAESTRO dataset and a Y ouTube dataset with 10,000 hours of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to relevant baselines. As the number of generative applications increase, it becomes increasingly important to consider how users can interact with such systems, particularly when the generative model functions as a tool in their creative process (Engel et al., 2017a; Gillick et al., 2019) To this end, we consider how one can learn high-level controls over the global structure of a generated sample. We focus on symbolic music generation, where Music Transformer (Huang et al., 2019b) is the current state-of-the-art in generating high-quality samples that span over a minute in length. The challenge in controllable sequence generation is that Transformers (V aswani et al., 2017) and their variants excel as language models or in sequence-to-sequence tasks such as translation, but it is less clear as to how they can: (1) learn and (2) incorporate global conditioning information at inference time.

The Bach Doodle: Approachable music composition with machine learning at scale Machine Learning

To make music composition more approachable, we designed the first AI-powered Google Doodle, the Bach Doodle, where users can create their own melody and have it harmonized by a machine learning model Coconet (Huang et al., 2017) in the style of Bach. For users to input melodies, we designed a simplified sheet-music based interface. To support an interactive experience at scale, we re-implemented Coconet in TensorFlow.js (Smilkov et al., 2019) to run in the browser and reduced its runtime from 40s to 2s by adopting dilated depth-wise separable convolutions and fusing operations. We also reduced the model download size to approximately 400KB through post-training weight quantization. We calibrated a speed test based on partial model evaluation time to determine if the harmonization request should be performed locally or sent to remote TPU servers. In three days, people spent 350 years worth of time playing with the Bach Doodle, and Coconet received more than 55 million queries. Users could choose to rate their compositions and contribute them to a public dataset, which we are releasing with this paper. We hope that the community finds this dataset useful for applications ranging from ethnomusicological studies, to music education, to improving machine learning models.

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset Machine Learning

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude ( 0.1 ms to 100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment ( 3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music. Since the beginning of the recent wave of deep learning research, there have been many attempts to create generative models of expressive musical audio de novo. These models would ideally generate audio that is both musically and sonically realistic to the point of being indistinguishable to a listener from music composed and performed by humans. However, modeling music has proven extremely difficult due to dependencies across the wide range of timescales that give rise to the characteristics of pitch and timbre (short-term) as well as those of rhythm (medium-term) and song structure (long-term). On the other hand, much of music has a large hierarchy of discrete structure embedded in its generative process: a composer creates songs, sections, and notes, and a performer realizes those notes with discrete events on their instrument, creating sound.

An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation Machine Learning

Music relies heavily on self-reference to build structure and meaning. We explore the Transformer architecture (Vaswani et al., 2017) as a generative model for music, as self-attention has shown compelling results on tasks that require long-term structure such as Wikipedia summary generation (Liu et al, 2018). However, timing information is critical for polyphonic music, and Transformer does not explicitly model absolute or relative timing in its structure. To address this challenge, Shaw et al. (2018) introduced relative position representations to self-attention to improve machine translation. However, the formulation was not scalable to longer sequences. We propose an improved formulation which reduces the memory requirements of the relative position computation from $O(l^2d)$ to $O(ld)$, making it possible to train much longer sequences and achieve faster convergence. In experiments on symbolic music we find that relative self-attention substantially improves sample quality for unconditioned generation and is able to generate sequences of lengths longer than those from the training set. When primed with an initial sequence, the model generates continuations that develop the prime coherently and exhibit long-term structure. Relative self-attention can be instrumental in capturing richer relationships within a musical piece.

Learning a Latent Space of Multitrack Measures Machine Learning

Discovering and exploring the underlying structure of multi-instrumental music using learning-based approaches remains an open problem. We extend the recent MusicVAE model to represent multitrack polyphonic measures as vectors in a latent space. Our approach enables several useful operations such as generating plausible measures from scratch, interpolating between measures in a musically meaningful way, and manipulating specific musical attributes. We also introduce chord conditioning, which allows all of these operations to be performed while keeping harmony fixed, and allows chords to be changed while maintaining musical "style". By generating a sequence of measures over a predefined chord progression, our model can produce music with convincing long-term structure. We demonstrate that our latent space model makes it possible to intuitively control and generate musical sequences with rich instrumentation (see for generated audio).

A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music Machine Learning

The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decoder, which first outputs embeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently. This structure encourages the model to utilize its latent code, thereby avoiding the "posterior collapse" problem which remains an issue for recurrent VAEs. We apply this architecture to modeling sequences of musical notes and find that it exhibits dramatically better sampling, interpolation, and reconstruction performance than a "flat" baseline model. An implementation of our "MusicVAE" is available online at

Onsets and Frames: Dual-Objective Piano Transcription Machine Learning

We consider the problem of transcribing polyphonic piano music with an emphasis on generalizing to unseen instruments. We use deep neural networks and propose a novel approach that predicts onsets and frames using both CNNs and LSTMs. This model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note to start unless the onset detector also agrees that an onset for that pitch is present in the frame. We focus on improving onsets and offsets together instead of either in isolation as we believe it correlates better with human musical perception. This technique results in over a 100% relative improvement in note with offset score on the MAPS dataset.