Rochlin, Igor (Bar-Ilan University) | Sarne, David (Bar-Ilan University)

This paper studies distributed cooperative multi-agent exploration methods in settings where the exploration is costly and the overall performance measure is determined by the minimum performance achieved by any of the individual agents. Such an exploration setting is applicable to various multi-agent systems, e.g., in Dynamic Spectrum Access exploration. The goal in such problems is to optimize the process as a whole, considering the tradeoffs between the quality of the solution obtained and the cost associated with the exploration and coordination between the agents. Through the analysis of the two extreme cases where coordination is completely free and when entirely disabled, we manage to extract the solution for the general case where coordination is taken to be costly, modeled as a fee that needs to be paid for each additional coordinated agent. The strategy structure for the general case is shown to be threshold-based, and the thresholds which are analytically derived in this paper can be calculated offline, resulting in a very low online computational load.

Berseth, Glen, van de Panne, Michiel

Deep reinforcement learning has achieved great strides in solving challenging motion control tasks. Recently, there has been significant work on methods for exploiting the data gathered during training, but there has been less work on how to best generate the data to learn from. For continuous action domains, the most common method for generating exploratory actions involves sampling from a Gaussian distribution centred around the mean action output by a policy. Although these methods can be quite capable, they do not scale well with the dimensionality of the action space, and can be dangerous to apply on hardware. We consider learning a forward dynamics model to predict the result, ($x_{t+1}$), of taking a particular action, ($u$), given a specific observation of the state, ($x_{t}$). With this model we perform internal look-ahead predictions of outcomes and seek actions we believe have a reasonable chance of success. This method alters the exploratory action space, thereby increasing learning speed and enables higher quality solutions to difficult problems, such as robotic locomotion and juggling.

Heckel, Reinhard, Ramchandran, Kannan

We consider the online one-class collaborative filtering (CF) problem that consists of recommending items to users over time in an online fashion based on positive ratings only. This problem arises when users respond only occasionally to a recommendation with a positive rating, and never with a negative one. We study the impact of the probability of a user responding to a recommendation, p_f, on the sample complexity, i.e., the number of ratings required to make `good' recommendations, and ask whether receiving positive and negative ratings, instead of positive ratings only, improves the sample complexity. Both questions arise in the design of recommender systems. We introduce a simple probabilistic user model, and analyze the performance of an online user-based CF algorithm. We prove that after an initial cold start phase, where recommendations are invested in exploring the user's preferences, this algorithm makes---up to a fraction of the recommendations required for updating the user's preferences---perfect recommendations. The number of ratings required for the cold start phase is nearly proportional to 1/p_f, and that for updating the user's preferences is essentially independent of p_f. As a consequence we find that, receiving positive and negative ratings instead of only positive ones improves the number of ratings required for initial exploration by a factor of 1/p_f, which can be significant.

Farias, Daniela D., Megiddo, Nimrod

A reactive environment is one that responds to the actions of an agent rather than evolving obliviously. In reactive environments, experts algorithms must balance exploration and exploitation of experts more carefully than in oblivious ones. In addition, a more subtle definition of a learnable value of an expert is required. A general exploration-exploitation experts method is presented along with a proper definition of value. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases.Complexity and performance bounds are proven.

Noda, Itsuki (National Institute of Advanced Industrial Science and Technology)

In this article, I show the robustness of optimality of exploration ratioagainst the number of agents (agent population)under multiagent learning (MAL) situation in nonstationary environments.Agent population will affect efficiency of agents' learning becauseeach agent's learning causes noisy factors for others.From this point, exploration ratio should be small to make MAL effective.In nonstationary environments, on the other hand, each agent needs explore with enough probability to catch-upchanges of the environments.This means the exploration ratio need to be significantly large.I investigate the relation between the population and the efficiency ofexploration based on a theorem about relations betweenthe exploration ratio and a lower boundary of learning error.Finally, it is shown that the population of the agents does not affectthe optimal exploration ratio under a certain condition.This consequence is confirmed by several experimentsusing population games with various reward functions.