Information Sharing Under Costly Communication in Joint Exploration

AAAI Conferences

This paper studies distributed cooperative multi-agent exploration methods in settings where the exploration is costly and the overall performance measure is determined by the minimum performance achieved by any of the individual agents. Such an exploration setting is applicable to various multi-agent systems, e.g., in Dynamic Spectrum Access exploration. The goal in such problems is to optimize the process as a whole, considering the tradeoffs between the quality of the solution obtained and the cost associated with the exploration and coordination between the agents. Through the analysis of the two extreme cases where coordination is completely free and when entirely disabled, we manage to extract the solution for the general case where coordination is taken to be costly, modeled as a fee that needs to be paid for each additional coordinated agent. The strategy structure for the general case is shown to be threshold-based, and the thresholds which are analytically derived in this paper can be calculated offline, resulting in a very low online computational load.

Model-Based Action Exploration for Learning Dynamic Motion Skills Artificial Intelligence

Deep reinforcement learning has achieved great strides in solving challenging motion control tasks. Recently, there has been significant work on methods for exploiting the data gathered during training, but there has been less work on how to best generate the data to learn from. For continuous action domains, the most common method for generating exploratory actions involves sampling from a Gaussian distribution centred around the mean action output by a policy. Although these methods can be quite capable, they do not scale well with the dimensionality of the action space, and can be dangerous to apply on hardware. We consider learning a forward dynamics model to predict the result, ($x_{t+1}$), of taking a particular action, ($u$), given a specific observation of the state, ($x_{t}$). With this model we perform internal look-ahead predictions of outcomes and seek actions we believe have a reasonable chance of success. This method alters the exploratory action space, thereby increasing learning speed and enables higher quality solutions to difficult problems, such as robotic locomotion and juggling.


AAAI Conferences

We present a novel, real-time system for exploring harmonic spaces of musical styles, to generate music in collaboration with human performers utilizing gesture devices (such as the Kinect) together with MIDI and OSC instruments / controllers. This corpus-based environment incorporates statistical and evolutionary components for exploring potential flows through harmonic spaces, utilizing power-law (Zipf-based) metrics for fitness evaluation. It supports visual exploration and navigation of harmonic transition probabilities through interactive gesture control. These probabilities are computed from musical corpora (in MIDI format). Herein we utilize the Classical Music Archives 14,000 MIDI corpus, among others. The user interface supports real-time exploration of the balance between predictability and surprise for musical composition and performance, and may be used in a variety of musical contexts and applications.

Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments

Neural Information Processing Systems

A reactive environment is one that responds to the actions of an agent rather than evolving obliviously. In reactive environments, experts algorithms must balance exploration and exploitation of experts more carefully than in oblivious ones. In addition, a more subtle definition of a learnable value of an expert is required. A general exploration-exploitation experts method is presented along with a proper definition of value. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases.Complexity and performance bounds are proven.

The Sample Complexity of Online One-Class Collaborative Filtering Machine Learning

We consider the online one-class collaborative filtering (CF) problem that consists of recommending items to users over time in an online fashion based on positive ratings only. This problem arises when users respond only occasionally to a recommendation with a positive rating, and never with a negative one. We study the impact of the probability of a user responding to a recommendation, p_f, on the sample complexity, i.e., the number of ratings required to make `good' recommendations, and ask whether receiving positive and negative ratings, instead of positive ratings only, improves the sample complexity. Both questions arise in the design of recommender systems. We introduce a simple probabilistic user model, and analyze the performance of an online user-based CF algorithm. We prove that after an initial cold start phase, where recommendations are invested in exploring the user's preferences, this algorithm makes---up to a fraction of the recommendations required for updating the user's preferences---perfect recommendations. The number of ratings required for the cold start phase is nearly proportional to 1/p_f, and that for updating the user's preferences is essentially independent of p_f. As a consequence we find that, receiving positive and negative ratings instead of only positive ones improves the number of ratings required for initial exploration by a factor of 1/p_f, which can be significant.