Media
Real-Time Audio-to-Score Alignment of Music Performances Containing Errors and Arbitrary Repeats and Skips
Nakamura, Tomohiko, Nakamura, Eita, Sagayama, Shigeki
This paper discusses real-time alignment of audio signals of music performance to the corresponding score (a.k.a. score following) which can handle tempo changes, errors and arbitrary repeats and/or skips (repeats/skips) in performances. This type of score following is particularly useful in automatic accompaniment for practices and rehearsals, where errors and repeats/skips are often made. Simple extensions of the algorithms previously proposed in the literature are not applicable in these situations for scores of practical length due to the problem of large computational complexity. To cope with this problem, we present two hidden Markov models of monophonic performance with errors and arbitrary repeats/skips, and derive efficient score-following algorithms with an assumption that the prior probability distributions of score positions before and after repeats/skips are independent from each other. We confirmed real-time operation of the algorithms with music scores of practical length (around 10000 notes) on a modern laptop and their tracking ability to the input performance within 0.7 s on average after repeats/skips in clarinet performance data. Further improvements and extension for polyphonic signals are also discussed.
Multilinear Subspace Clustering
Kernfeld, Eric, Majumder, Nathan, Aeron, Shuchin, Kilmer, Misha
ABSTRACT In this paper we present a new model and an algorithm for unsupervised clustering of 2-D data such as images. We assume that the data comes from a union of multilinear subspaces (UOMS) model, which is a specific structured case of the much studied union of subspaces (UOS) model. For segmentation under this model, we develop Multilinear Subspace Clustering (MSC) algorithm and evaluate its performance on the YaleB and Olivietti image data sets. We show that MSC is highly competitive with existing algorithms employing the UOS model in terms of clustering performance while enjoying improvement in computational complexity. Index Terms - subspace clustering, multilinear algebra, spectral clustering 1. INTRODUCTION Most clustering algorithms seek to detect disjoint clouds of data.
A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation
Mayer, Nikolaus, Ilg, Eddy, Häusser, Philip, Fischer, Philipp, Cremers, Daniel, Dosovitskiy, Alexey, Brox, Thomas
Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we propose three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks. Our datasets are the first large-scale datasets to enable training and evaluating scene flow methods. Besides the datasets, we present a convolutional network for real-time disparity estimation that provides state-of-the-art results. By combining a flow and disparity estimation network and training it jointly, we demonstrate the first scene flow estimation with a convolutional network.
Top-N recommendations from expressive recommender systems
Normalized nonnegative models assign probability distributions to users and random variables to items; see [Stark, 2015]. Rating an item is regarded as sampling the random variable assigned to the item with respect to the distribution assigned to the user who rates the item. Models of that kind are highly expressive. For instance, using normalized nonnegative models we can understand users' preferences as mixtures of interpretable user stereotypes, and we can arrange properties of users and items in a hierarchical manner. These features would not be useful if the predictive power of normalized nonnegative models was poor. Thus, we analyze here the performance of normalized nonnegative models for top-N recommendation and observe that their performance matches the performance of methods like PureSVD which was introduced in [Cremonesi et al., 2010]. We conclude that normalized nonnegative models not only provide accurate recommendations but they also deliver (for free) representations that are interpretable. We deepen the discussion of normalized nonnegative models by providing further theoretical insights. In particular, we introduce total variational distance as an operational similarity measure, we discover scenarios where normalized nonnegative models yield unique representations of users and items, we prove that the inference of optimal normalized nonnegative models is NP-hard and finally, we discuss the relationship between normalized nonnegative models and nonnegative matrix factorization.
Teaching Machines to Read and Comprehend
Hermann, Karl Moritz, Kočiský, Tomáš, Grefenstette, Edward, Espeholt, Lasse, Kay, Will, Suleyman, Mustafa, Blunsom, Phil
Teaching machines to read natural language documents remains an elusive challenge. Machine reading systems can be tested on their ability to answer questions posed on the contents of documents that they have seen, but until now large scale training and test datasets have been missing for this type of evaluation. In this work we define a new methodology that resolves this bottleneck and provides large scale supervised reading comprehension data. This allows us to develop a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure.
Ethical Artificial Intelligence
This book-length article combines several peer reviewed papers and new material to analyze the issues of ethical artificial intelligence (AI). The behavior of future AI systems can be described by mathematical equations, which are adapted to analyze possible unintended AI behaviors and ways that AI designs can avoid them. This article makes the case for utility-maximizing agents and for avoiding infinite sets in agent definitions. It shows how to avoid agent self-delusion using model-based utility functions and how to avoid agents that corrupt their reward generators (sometimes called "perverse instantiation") using utility functions that evaluate outcomes at one point in time from the perspective of humans at a different point in time. It argues that agents can avoid unintended instrumental actions (sometimes called "basic AI drives" or "instrumental goals") by accurately learning human values. This article defines a self-modeling agent framework and shows how it can avoid problems of resource limits, being predicted by other agents, and inconsistency between the agent's utility function and its definition (one version of this problem is sometimes called "motivated value selection"). This article also discusses how future AI will differ from current AI, the politics of AI, and the ultimate use of AI to help understand the nature of the universe and our place in it.
Combinatorial Cascading Bandits
Kveton, Branislav, Wen, Zheng, Ashkan, Azin, Szepesvari, Csaba
We propose combinatorial cascading bandits, a class of partial monitoring problems where at each step a learning agent chooses a tuple of ground items subject to constraints and receives a reward if and only if the weights of all chosen items are one. The weights of the items are binary, stochastic, and drawn independently of each other. The agent observes the index of the first chosen item whose weight is zero. This observation model arises in network routing, for instance, where the learning agent may only observe the first link in the routing path which is down, and blocks the path. We propose a UCB-like algorithm for solving our problems, CombCascade; and prove gap-dependent and gap-free upper bounds on its $n$-step regret. Our proofs build on recent work in stochastic combinatorial semi-bandits but also address two novel challenges of our setting, a non-linear reward function and partial observability. We evaluate CombCascade on two real-world problems and show that it performs well even when our modeling assumptions are violated. We also demonstrate that our setting requires a new learning algorithm.
Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization
Xu, Baohan, Fu, Yanwei, Jiang, Yu-Gang, Li, Boyang, Sigal, Leonid
Rapid development of mobile devices has led to an explosive growth of user-generated images and videos, which creates a demand for computational understanding of visual media content. In addition to recognition of objective content, such as objects and scenes, an important dimension of video content analysis is the understanding of emotional or affective content, i.e. estimating the emotional impact of the video on a viewer. Emotional content can strongly resonate with viewers and plays a crucial role in the videowatching experience. Some successes have been achieved with the use of deep-learning architectures trained for text at both sentence-and document-level [40] or image sentiment analysis [8]. However, the ability to understand emotions from video, to a large extent, remains an unsolved problem. Analysis of emotional content in video has many realworld applications. Video recommendation services can benefit from matching user interests with the emotions of video content and prediction of interestingness [20], [21], [36], leading to improved user satisfaction. Better understanding of video emotions may enable advertising that is consistent with the main video's mood and help avoid social inappropriateness such as placing a funny advertisement alongside a funeral video. Video summarization [68] and coding [60] can also benefit from understanding emotions, since an accurate summary should keep the emotional content conveyed by the original video.
Expressive recommender systems through normalized nonnegative models
We introduce normalized nonnegative models (NNM) for explorative data analysis. NNMs are partial convexifications of models from probability theory. We demonstrate their value at the example of item recommendation. We show that NNM-based recommender systems satisfy three criteria that all recommender systems should ideally satisfy: high predictive power, computational tractability, and expressive representations of users and items. Expressive user and item representations are important in practice to succinctly summarize the pool of customers and the pool of items. In NNMs, user representations are expressive because each user's preference can be regarded as normalized mixture of preferences of stereotypical users. The interpretability of item and user representations allow us to arrange properties of items (e.g., genres of movies or topics of documents) or users (e.g., personality traits) hierarchically.
Cinematic, Ambient, Inhabitable Narrative Environments: Story Systems in Search of an Artificial Intelligence Engine
Wingate, Steven Nicholas (South Dakota State University)
Cinematic, Ambient, Inhabitable Narrative Environments (CAINEs) are conceptual AI-driven interactive story systems combining text, audio, and visual imagery that are scalable and adaptable to a wide range of storytelling needs and interactor inputs. Conceived by at artist outside the AI community, they represent an opportunity to use AI in a nontraditional and immersive narrative fashion that relies not on the goal-based arrangement of story elements, but on the accretion and association of those elements in the minds of interactors. This paper represents the initial phase of the project’s development.