Exploring Musical Roots: Applying Audio Embeddings to Empower Influence Attribution for a Generative Music Model

Barnett, Julia, Garcia, Hugo Flores, Pardo, Bryan

arXiv.org Artificial Intelligence 

With today's models there is an opaque nature to the generation process--it is never clear to the end user what data influences and shapes their newly crafted essay from ChatGPT [39], digitized surrealist art from DALLE-2 [42], or soulful jazz in the style of Rihanna from MusicLM [1]. Even further, due to the vast amounts of data they were trained on, it is usually not even clear when these models are "creating" near replicas of existing items from their training data. For users of generative models to be informed and responsible creators, there needs to be a mechanism that provides information about works in the model's training data that were highly influential upon the generated output, or directly copied by the model. This would allow the user to both cite existing work and learn about the influences of their generated output. We assume a model-generated product that is a copy or near-copy of a work in the model's training set indicates the model was influenced by that work. To develop methods to automatically detect the influences upon model-generated products it is, therefore, essential to develop good measures of similarity between works. In text, it is straightforward to detect when language models copy strings of text verbatim, given access to the training data. There is a growing body of work quantifying the degree to which these large language models memorize training data [10, 12, 23]. In the image space, it is more complex due to the high-resolution multi-pixel outputs of models, but work is being done to detect "approximate memorization" by finding highly similar images from the training data