Encoding Musical Style with Transformer Autoencoders

arXiv.org Machine Learning

A BSTRACT We consider the problem of learning high-level controls over the global structure of sequence generation, particularly in the context of symbolic music generation with complex language models. In this work, we present the Transformer au-toencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance. We show it is possible to combine this global embedding with other temporally distributed embeddings, enabling improved control over the separate aspects of performance style and and melody. Empirically, we demonstrate the effectiveness of our method on a variety of music generation tasks on the MAESTRO dataset and a Y ouTube dataset with 10,000 hours of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to relevant baselines. As the number of generative applications increase, it becomes increasingly important to consider how users can interact with such systems, particularly when the generative model functions as a tool in their creative process (Engel et al., 2017a; Gillick et al., 2019) To this end, we consider how one can learn high-level controls over the global structure of a generated sample. We focus on symbolic music generation, where Music Transformer (Huang et al., 2019b) is the current state-of-the-art in generating high-quality samples that span over a minute in length. The challenge in controllable sequence generation is that Transformers (V aswani et al., 2017) and their variants excel as language models or in sequence-to-sequence tasks such as translation, but it is less clear as to how they can: (1) learn and (2) incorporate global conditioning information at inference time.


Improving Polyphonic Music Models with Feature-Rich Encoding

arXiv.org Machine Learning

This paper explores sequential modeling of polyphonic music with deep neural networks. While recent breakthroughs have focussed on network architecture, we demonstrate that the representation of the sequence can make an equally significant contribution to the performance of the model as measured by validation set loss. By extracting salient features inherent to the dataset, the model can either be conditioned on these features or trained to predict said features as extra components of the sequences being modeled. We show that training a neural network to predict a seemingly more complex sequence, with extra features included in the series being modeled, can improve overall model performance significantly. We first introduce TonicNet, a GRU-based model trained to initially predict the chord at a given time-step before then predicting the notes of each voice at that time-step, in contrast with the typical approach of predicting only the notes. We then evaluate TonicNet on the canonical JSB Chorales dataset and obtain state-of-the-art results.


LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

arXiv.org Machine Learning

We are interested in the task of generating multi-instrumental music scores. The Transformer architecture has recently shown great promise for the task of piano score generation; here we adapt it to the multi-instrumental setting. Transformers are complex, high-dimensional language models which are capable of capturing long-term structure in sequence data, but require large amounts of data to fit. Their success on piano score generation is partially explained by the large volumes of symbolic data readily available for that domain. We leverage the recently-introduced NES-MDB dataset of four-instrument scores from an early video game sound synthesis chip (the NES), which we find to be well-suited to training with the Transformer architecture. To further improve the performance of our model, we propose a pre-training technique to leverage the information in a large collection of heterogeneous music, namely the Lakh MIDI dataset. Despite differences between the two corpora, we find that this transfer learning procedure improves both quantitative and qualitative performance for our primary task.


Pop Music Transformer: Generating Music with Rhythm and Harmony

arXiv.org Machine Learning

The task automatic music composition entails generative modeling of music in symbolic formats such as the musical scores. By serializing a score as a sequence of MIDI-like events, recent work has demonstrated that state-of-the-art sequence models with self-attention work nicely for this task, especially for composing music with long-range coherence. In this paper, we show that sequence models can do even better when we improve the way a musical score is converted into events. The new event set, dubbed "REMI" (REvamped MIDI-derived events), provides sequence models a metric context for modeling the rhythmic patterns of music, while allowing for local tempo changes. Moreover, it explicitly sets up a harmonic structure and makes chord progression controllable. It also facilitates coordinating different tracks of a musical piece, such as the piano, bass and drums. With this new approach, we build a Pop Music Transformer that composes Pop piano music with a more plausible rhythmic structure than prior arts do. The code, data and pre-trained model are publicly available.\footnote{\url{https://github.com/YatingMusic/remi}}


Generating Long Sequences with Sparse Transformers

arXiv.org Machine Learning

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.