Colton, Simon
ImprovNet: Generating Controllable Musical Improvisations with Iterative Corruption Refinement
Bhandari, Keshav, Chang, Sungkyun, Lu, Tongyu, Enus, Fareza R., Bradshaw, Louis B., Herremans, Dorien, Colton, Simon
Deep learning has enabled remarkable advances in style transfer across various domains, offering new possibilities for creative content generation. However, in the realm of symbolic music, generating controllable and expressive performance-level style transfers for complete musical works remains challenging due to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks. This paper presents ImprovNet, a transformer-based architecture that generates expressive and controllable musical improvisations through a self-supervised corruption-refinement training strategy. ImprovNet unifies multiple capabilities within a single model: it can perform cross-genre and intra-genre improvisations, harmonize melodies with genre-specific styles, and execute short prompt continuation and infilling tasks. The model's iterative generation framework allows users to control the degree of style transfer and structural similarity to the original composition. Objective and subjective evaluations demonstrate ImprovNet's effectiveness in generating musically coherent improvisations while maintaining structural relationships with the original pieces. The model outperforms Anticipatory Music Transformer in short continuation and infilling tasks and successfully achieves recognizable genre conversion, with 79\% of participants correctly identifying jazz-style improvisations. Our code and demo page can be found at https://github.com/keshavbhandari/improvnet.
Yin-Yang: Developing Motifs With Long-Term Structure And Controllability
Bhandari, Keshav, Wiggins, Geraint A., Colton, Simon
Transformer models have made great strides in generating symbolically represented music with local coherence. However, controlling the development of motifs in a structured way with global form remains an open research area. One of the reasons for this challenge is due to the note-by-note autoregressive generation of such models, which lack the ability to correct themselves after deviations from the motif. In addition, their structural performance on datasets with shorter durations has not been studied in the literature. In this study, we propose Yin-Yang, a framework consisting of a phrase generator, phrase refiner, and phrase selector models for the development of motifs into melodies with long-term structure and controllability. The phrase refiner is trained on a novel corruption-refinement strategy which allows it to produce melodic and rhythmic variations of an original motif at generation time, thereby rectifying deviations of the phrase generator. We also introduce a new objective evaluation metric for quantifying how smoothly the motif manifests itself within the piece. Evaluation results show that our model achieves better performance compared to state-of-the-art transformer models while having the advantage of being controllable and making the generated musical structure semi-interpretable, paving the way for musical analysis. Our code and demo page can be found at https://github.com/keshavbhandari/yinyang.
Text2midi: Generating Symbolic Music from Captions
Bhandari, Keshav, Roy, Abhinaba, Wang, Kyra, Puri, Geeta, Colton, Simon, Herremans, Dorien
This paper introduces text2midi, an end-to-end model to generate MIDI files from textual descriptions. Leveraging the growing popularity of multimodal generative approaches, text2midi capitalizes on the extensive availability of textual data and the success of large language models (LLMs). Our end-to-end system harnesses the power of LLMs to generate symbolic music in the form of MIDI files. Specifically, we utilize a pretrained LLM encoder to process captions, which then condition an autoregressive transformer decoder to produce MIDI sequences that accurately reflect the provided descriptions. This intuitive and user-friendly method significantly streamlines the music creation process by allowing users to generate music pieces using text prompts. We conduct comprehensive empirical evaluations, incorporating both automated and human studies, that show our model generates MIDI files of high quality that are indeed controllable by text captions that may include music theory terms such as chords, keys, and tempo. We release the code and music samples on our demo page (https://github.com/AMAAI-Lab/Text2midi) for users to interact with text2midi.
Motifs, Phrases, and Beyond: The Modelling of Structure in Symbolic Music Generation
Bhandari, Keshav, Colton, Simon
Modelling musical structure is vital yet challenging for artificial intelligence systems that generate symbolic music compositions. This literature review dissects the evolution of techniques for incorporating coherent structure, from symbolic approaches to foundational and transformative deep learning methods that harness the power of computation and data across a wide variety of training paradigms. In the later stages, we review an emerging technique which we refer to as "sub-task decomposition" that involves decomposing music generation into separate high-level structural planning and content creation stages. Such systems incorporate some form of musical knowledge or neuro-symbolic methods by extracting melodic skeletons or structural templates to guide the generation. Progress is evident in capturing motifs and repetitions across all three eras reviewed, yet modelling the nuanced development of themes across extended compositions in the style of human composers remains difficult. We outline several key future directions to realize the synergistic benefits of combining approaches from all eras examined.
Exploring XAI for the Arts: Explaining Latent Space in Generative Music
Bryan-Kinns, Nick, Banar, Berker, Ford, Corey, Reed, Courtney N., Zhang, Yixiao, Colton, Simon, Armitage, Jack
Explainable AI has the potential to support more interactive and fluid co-creative AI systems which can creatively collaborate with people. To do this, creative AI models need to be amenable to debugging by offering eXplainable AI (XAI) features which are inspectable, understandable, and modifiable. However, currently there is very little XAI for the arts. In this work, we demonstrate how a latent variable model for music generation can be made more explainable; specifically we extend MeasureVAE which generates measures of music. We increase the explainability of the model by: i) using latent space regularisation to force some specific dimensions of the latent space to map to meaningful musical attributes, ii) providing a user interface feedback loop to allow people to adjust dimensions of the latent space and observe the results of these changes in real-time, iii) providing a visualisation of the musical attributes in the latent space to help people understand and predict the effect of changes to latent space dimensions. We suggest that in doing so we bridge the gap between the latent space and the generated musical outcomes in a meaningful way which makes the model and its outputs more explainable and more debuggable. The code repository can be found at: https://github.com/bbanar2/
Towards Mode Balancing of Generative Models via Diversity Weights
Berns, Sebastian, Colton, Simon, Guckelsberger, Christian
Large data-driven image models are extensively used to support creative and artistic work. Under the currently predominant distribution-fitting paradigm, a dataset is treated as ground truth to be approximated as closely as possible. Yet, many creative applications demand a diverse range of output, and creators often strive to actively diverge from a given data distribution. We argue that an adjustment of modelling objectives, from pure mode coverage towards mode balancing, is necessary to accommodate the goal of higher output diversity. We present diversity weights, a training scheme that increases a model's output diversity by balancing the modes in the training dataset. First experiments in a controlled setting demonstrate the potential of our method. We discuss connections of our approach to diversity, equity, and inclusion in generative machine learning more generally, and computational creativity specifically. An implementation of our algorithm is available at https://github.com/sebastianberns/diversity-weights
Trash to Treasure: Using text-to-image models to inform the design of physical artefacts
Smith, Amy, Schroeder, Hope, Epstein, Ziv, Cook, Michael, Colton, Simon, Lippman, Andrew
Text-to-image generative models have recently exploded in popularity and accessibility. Yet so far, use of these models in creative tasks that bridge the 2D digital world and the creation of physical artefacts has been understudied. We conduct a pilot study to investigate if and how text-to-image models can be used to assist in upstream tasks within the creative process, such as ideation and visualization, prior to a sculpture-making activity. Thirty participants selected sculpture-making materials and generated three images using the Stable Diffusion text-to-image generator, each with text prompts of their choice, with the aim of informing and then creating a physical sculpture. The majority of participants (23/30) reported that the generated images informed their sculptures, and 28/30 reported interest in using text-to-image models to help them in a creative task in the future. We identify several prompt engineering strategies and find that a participant's prompting strategy relates to their stage in the creative process. We discuss how our findings can inform support for users at different stages of the design process and for using text-to-image models for physical artefact design.
Active Divergence with Generative Deep Learning -- A Survey and Taxonomy
Broad, Terence, Berns, Sebastian, Colton, Simon, Grierson, Mick
Generative deep learning systems offer powerful tools for artefact generation, given their ability to model distributions of data and generate high-fidelity results. In the context of computational creativity, however, a major shortcoming is that they are unable to explicitly diverge from the training data in creative ways and are limited to fitting the target data distribution. To address these limitations, there have been a growing number of approaches for optimising, hacking and rewriting these models in order to actively diverge from the training data. We present a taxonomy and comprehensive survey of the state of the art of active divergence techniques, highlighting the potential for computational creativity researchers to advance these methods and use deep generative models in truly creative systems.
Modelling serendipity in a computational context
Corneli, Joseph, Jordanous, Anna, Guckelsberger, Christian, Pease, Alison, Colton, Simon
Building on a survey of previous theories of serendipity and creativity, we advance a sequential model of serendipitous occurrences. We distinguish between serendipity as a service and serendipity in the system itself, clarify the role of invention and discovery, and provide a measure for the serendipity potential of a system. While a system can arguably not be guaranteed to be serendipitous, it can have a high potential for serendipity. Practitioners can use these theoretical tools to evaluate a computational system's potential for unexpected behaviour that may have a beneficial outcome. In addition to a qualitative features of serendipity potential, the model also includes quantitative ratings that can guide development work. We show how the model is used in three case studies of existing and hypothetical systems, in the context of evolutionary computing, automated programming, and (next-generation) recommender systems. From this analysis, we extract recommendations for practitioners working with computational serendipity, and outline future directions for research.
The Digital Synaptic Neural Substrate: A New Approach to Computational Creativity
Iqbal, Azlan, Guid, Matej, Colton, Simon, Krivec, Jana, Azman, Shazril, Haghighi, Boshra
We introduce a new artificial intelligence (AI) approach called, the 'Digital Synaptic Neural Substrate' (DSNS). It uses selected attributes from objects in various domains (e.g. chess problems, classical music, renowned artworks) and recombines them in such a way as to generate new attributes that can then, in principle, be used to create novel objects of creative value to humans relating to any one of the source domains. This allows some of the burden of creative content generation to be passed from humans to machines. The approach was tested in the domain of chess problem composition. We used it to automatically compose numerous sets of chess problems based on attributes extracted and recombined from chess problems and tournament games by humans, renowned paintings, computer-evolved abstract art, photographs of people, and classical music tracks. The quality of these generated chess problems was then assessed automatically using an existing and experimentally-validated computational chess aesthetics model. They were also assessed by human experts in the domain. The results suggest that attributes collected and recombined from chess and other domains using the DSNS approach can indeed be used to automatically generate chess problems of reasonably high aesthetic quality. In particular, a low quality chess source (i.e. tournament game sequences between weak players) used in combination with actual photographs of people was able to produce three-move chess problems of comparable quality or better to those generated using a high quality chess source (i.e. published compositions by human experts), and more efficiently as well. Why information from a foreign domain can be integrated and functional in this way remains an open question for now. The DSNS approach is, in principle, scalable and applicable to any domain in which objects have attributes that can be represented using real numbers.