Collaborating Authors


Generating Music using Deep Learning


Deep learning has radically transformed the fields of computer vision and natural language processing, in not just classification but also generative tasks, enabling the creation of unbelievably realistic pictures as well as artificially generated news articles. In this project, we aim to create novel neural network architectures to generate new music, using 20,000 MIDI samples of different genres from the Lakh Piano Dataset, a popular benchmark dataset for recent music generation tasks. This project was a group effort by Isaac Tham and Matthew Kim, senior-year undergraduates at the University of Pennsylvania. Music generation using deep learning techniques has been a topic of interest for the past two decades. Music proves to be a different challenge compared to images, among three main dimensions: Firstly, music is temporal, with a hierarchical structure with dependencies across time. Secondly, music consists of multiple instruments that are interdependent and unfold across time.

State of AI Ethics Report (Volume 6, February 2022) Artificial Intelligence

This report from the Montreal AI Ethics Institute (MAIEI) covers the most salient progress in research and reporting over the second half of 2021 in the field of AI ethics. Particular emphasis is placed on an "Analysis of the AI Ecosystem", "Privacy", "Bias", "Social Media and Problematic Information", "AI Design and Governance", "Laws and Regulations", "Trends", and other areas covered in the "Outside the Boxes" section. The two AI spotlights feature application pieces on "Constructing and Deconstructing Gender with AI-Generated Art" as well as "Will an Artificial Intellichef be Cooking Your Next Meal at a Michelin Star Restaurant?". Given MAIEI's mission to democratize AI, submissions from external collaborators have featured, such as pieces on the "Challenges of AI Development in Vietnam: Funding, Talent and Ethics" and using "Representation and Imagination for Preventing AI Harms". The report is a comprehensive overview of what the key issues in the field of AI ethics were in 2021, what trends are emergent, what gaps exist, and a peek into what to expect from the field of AI ethics in 2022. It is a resource for researchers and practitioners alike in the field to set their research and development agendas to make contributions to the field of AI ethics.

MusIAC: An extensible generative framework for Music Infilling Applications with multi-level Control Artificial Intelligence

We present a novel music generation framework for music infilling, with a user friendly interface. Infilling refers to the task of generating musical sections given the surrounding multi-track music. The proposed transformer-based framework is extensible for new control tokens as the added music control tokens such as tonal tension per bar and track polyphony level in this work. We explore the effects of including several musically meaningful control tokens, and evaluate the results using objective metrics related to pitch and rhythm. Our results demonstrate that adding additional control tokens helps to generate music with stronger stylistic similarities to the original music. It also provides the user with more control to change properties like the music texture and tonal tension in each bar compared to previous research which only provided control for track density. We present the model in a Google Colab notebook to enable interactive generation.

TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music Artificial Intelligence

Singing melody extraction is an important problem in the field of music information retrieval. Existing methods typically rely on frequency-domain representations to estimate the sung frequencies. However, this design does not lead to human-level performance in the perception of melody information for both tone (pitch-class) and octave. In this paper, we propose TONet, a plug-and-play model that improves both tone and octave perceptions by leveraging a novel input representation and a novel network architecture. First, we present an improved input representation, the Tone-CFP, that explicitly groups harmonics via a rearrangement of frequency-bins. Second, we introduce an encoder-decoder architecture that is designed to obtain a salience feature map, a tone feature map, and an octave feature map. Third, we propose a tone-octave fusion mechanism to improve the final salience feature map. Experiments are done to verify the capability of TONet with various baseline backbone models. Our results show that tone-octave fusion with Tone-CFP can significantly improve the singing voice extraction performance across various datasets -- with substantial gains in octave and tone accuracy.

Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data Artificial Intelligence

Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet. First, we propose a transformer-based sound event detection system for processing weakly-labeled training data. Second, we devise a query-based audio separation model that leverages this data for model training. Third, we design a latent embedding processor to encode queries that specify audio targets for separation, allowing for zero-shot generalization. Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training. In addition, the proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training. To evaluate the separation performance, we test our model on MUSDB18, while training on the disjoint AudioSet. We further verify the zero-shot performance by conducting another experiment on audio source types that are held-out from training. The model achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.

When Creators Meet the Metaverse: A Survey on Computational Arts Artificial Intelligence

The metaverse, enormous virtual-physical cyberspace, has brought unprecedented opportunities for artists to blend every corner of our physical surroundings with digital creativity. This article conducts a comprehensive survey on computational arts, in which seven critical topics are relevant to the metaverse, describing novel artworks in blended virtual-physical realities. The topics first cover the building elements for the metaverse, e.g., virtual scenes and characters, auditory, textual elements. Next, several remarkable types of novel creations in the expanded horizons of metaverse cyberspace have been reflected, such as immersive arts, robotic arts, and other user-centric approaches fuelling contemporary creative outputs. Finally, we propose several research agendas: democratising computational arts, digital privacy, and safety for metaverse artists, ownership recognition for digital artworks, technological challenges, and so on. The survey also serves as introductory material for artists and metaverse technologists to begin creations in the realm of surrealistic cyberspace.

Logical Activation Functions: Logit-space equivalents of Boolean Operators Artificial Intelligence

Neuronal representations within artificial neural networks are commonly understood as logits, representing the log-odds score of presence (versus absence) of features within the stimulus. Under this interpretation, we can derive the probability $P(x_0 \land x_1)$ that a pair of independent features are both present in the stimulus from their logits. By converting the resulting probability back into a logit, we obtain a logit-space equivalent of the AND operation. However, since this function involves taking multiple exponents and logarithms, it is not well suited to be directly used within neural networks. We thus constructed an efficient approximation named $\text{AND}_\text{AIL}$ (the AND operator Approximate for Independent Logits) utilizing only comparison and addition operations, which can be deployed as an activation function in neural networks. Like MaxOut, $\text{AND}_\text{AIL}$ is a generalization of ReLU to two-dimensions. Additionally, we constructed efficient approximations of the logit-space equivalents to the OR and XNOR operators. We deployed these new activation functions, both in isolation and in conjunction, and demonstrated their effectiveness on a variety of tasks including image classification, transfer learning, abstract reasoning, and compositional zero-shot learning.

Real-Time Learning from An Expert in Deep Recommendation Systems with Marginal Distance Probability Distribution Machine Learning

Recommendation systems play an important role in today's digital world. They have found applications in various applications such as music platforms, e.g., Spotify, and movie streaming services, e.g., Netflix. Less research effort has been devoted to physical exercise recommendation systems. Sedentary lifestyles have become the major driver of several diseases as well as healthcare costs. In this paper, we develop a recommendation system for daily exercise activities to users based on their history, profile and similar users. The developed recommendation system uses a deep recurrent neural network with user-profile attention and temporal attention mechanisms. Moreover, exercise recommendation systems are significantly different from streaming recommendation systems in that we are not able to collect click feedback from the participants in exercise recommendation systems. Thus, we propose a real-time, expert-in-the-loop active learning procedure. The active learners calculate the uncertainty of the recommender at each time step for each user and ask an expert for a recommendation when the certainty is low. In this paper, we derive the probability distribution function of marginal distance, and use it to determine when to ask experts for feedback. Our experimental results on a mHealth dataset show improved accuracy after incorporating the real-time active learner with the recommendation system.

Visual Scene Graphs for Audio Source Separation Artificial Intelligence

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to better characterize the sources, especially when the same object class may produce varied sounds from distinct interactions. To address this challenging problem, we propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs, each subgraph being associated with a unique sound obtained by co-segmenting the audio spectrogram. At its core, AVSGS uses a recursive neural network that emits mutually-orthogonal sub-graph embeddings of the visual graph using multi-head attention. These embeddings are used for conditioning an audio encoder-decoder towards source separation. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds. In this paper, we also introduce an "in the wild'' video dataset for sound source separation that contains multiple non-musical sources, which we call Audio Separation in the Wild (ASIW). This dataset is adapted from the AudioCaps dataset, and provides a challenging, natural, and daily-life setting for source separation. Thorough experiments on the proposed ASIW and the standard MUSIC datasets demonstrate state-of-the-art sound separation performance of our method against recent prior approaches.

MS-SincResNet: Joint learning of 1D and 2D kernels using multi-scale SincNet and ResNet for music genre classification Artificial Intelligence

In this study, we proposed a new end-to-end convolutional neural network, called MS-SincResNet, for music genre classification. MS-SincResNet appends 1D multi-scale SincNet (MS-SincNet) to 2D ResNet as the first convolutional layer in an attempt to jointly learn 1D kernels and 2D kernels during the training stage. First, an input music signal is divided into a number of fixed-duration (3 seconds in this study) music clips, and the raw waveform of each music clip is fed into 1D MS-SincNet filter learning module to obtain three-channel 2D representations. The learned representations carry rich timbral, harmonic, and percussive characteristics comparing with spectrograms, harmonic spectrograms, percussive spectrograms and Mel-spectrograms. ResNet is then used to extract discriminative embeddings from these 2D representations. The spatial pyramid pooling (SPP) module is further used to enhance the feature discriminability, in terms of both time and frequency aspects, to obtain the classification label of each music clip. Finally, the voting strategy is applied to summarize the classification results from all 3-second music clips. In our experimental results, we demonstrate that the proposed MS-SincResNet outperforms the baseline SincNet and many well-known hand-crafted features. Considering individual 2D representation, MS-SincResNet also yields competitive results with the state-of-the-art methods on the GTZAN dataset and the ISMIR2004 dataset. The code is available at