Goto

Collaborating Authors

 Reinforcement Learning


Machine Theory of Mind

arXiv.org Artificial Intelligence

Theory of mind (ToM; Premack & Woodruff, 1978) broadly refers to humans' ability to represent the mental states of others, including their desires, beliefs, and intentions. We propose to train a machine to build such models too. We design a Theory of Mind neural network -- a ToMnet -- which uses meta-learning to build models of the agents it encounters, from observations of their behaviour alone. Through this process, it acquires a strong prior model for agents' behaviour, as well as the ability to bootstrap to richer predictions about agents' characteristics and mental states using only a small number of behavioural observations. We apply the ToMnet to agents behaving in simple gridworld environments, showing that it learns to model random, algorithmic, and deep reinforcement learning agents from varied populations, and that it passes classic ToM tasks such as the "Sally-Anne" test (Wimmer & Perner, 1983; Baron-Cohen et al., 1985) of recognising that others can hold false beliefs about the world. We argue that this system -- which autonomously learns how to model other agents in its world -- is an important step forward for developing multi-agent AI systems, for building intermediating technology for machine-human interaction, and for advancing the progress on interpretable AI.


Learning to Gather without Communication

arXiv.org Machine Learning

A standard belief on emerging collective behavior is that it emerges from simple individual rules. Most of the mathematical research on such collective behavior starts from imperative individual rules, like always go to the center. But how could an (optimal) individual rule emerge during a short period within the group lifetime, especially if communication is not available. We argue that such rules can actually emerge in a group in a short span of time via collective (multi-agent) reinforcement learning, i.e learning via rewards and punishments. We consider the gathering problem: several agents (social animals, swarming robots...) must gather around a same position, which is not determined in advance. They must do so without communication on their planned decision, just by looking at the position of other agents. We present the first experimental evidence that a gathering behavior can be learned without communication in a partially observable environment. The learned behavior has the same properties as a self-stabilizing distributed algorithm, as processes can gather from any initial state (and thus tolerate any transient failure). Besides, we show that it is possible to tolerate the brutal loss of up to 90\% of agents without significant impact on the behavior.


Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making

arXiv.org Machine Learning

In multi-objective decision planning and learning, much attention is paid to producing optimal solution sets that contain an optimal policy for every possible user preference profile. We argue that the step that follows, i.e, determining which policy to execute by maximising the user's intrinsic utility function over this (possibly infinite) set, is under-studied. This paper aims to fill this gap. We build on previous work on Gaussian processes and pairwise comparisons for preference modelling, extend it to the multi-objective decision support scenario, and propose new ordered preference elicitation strategies based on ranking and clustering. Our main contribution is an in-depth evaluation of these strategies using computer and human-based experiments. We show that our proposed elicitation strategies outperform the currently used pairwise methods, and found that users prefer ranking most. Our experiments further show that utilising monotonicity information in GPs by using a linear prior mean at the start and virtual comparisons to the nadir and ideal points, increases performance. We demonstrate our decision support framework in a real-world study on traffic regulation, conducted with the city of Amsterdam.


Clipped Action Policy Gradient

arXiv.org Machine Learning

Many continuous control tasks have bounded action spaces and clip out-of-bound actions before execution. Policy gradient methods often optimize policies as if actions were not clipped. We propose clipped action policy gradient (CAPG) as an alternative policy gradient estimator that exploits the knowledge of actions being clipped to reduce the variance in estimation. We prove that CAPG is unbiased and achieves lower variance than the original estimator that ignores action bounds. Experimental results demonstrate that CAPG generally outperforms the original estimator, indicating its promise as a better policy gradient estimator for continuous control tasks.


Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation

arXiv.org Artificial Intelligence

We present the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge under off-policy training. Our solution targets problems in reinforcement learning where the action representation adds to the-curse-of-dimensionality; that is, with continuous or large action sets, thus making it infeasible to estimate state-action value functions (Q functions). Using state-value functions helps to lift the curse and as a result naturally turn our policy-gradient solution into classical Actor-Critic architecture whose Actor uses state-value function for the update. Our algorithms, Gradient Actor-Critic and Emphatic Actor-Critic, are derived based on the exact gradient of averaged state-value function objective and thus are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical Actor-Critic methods with no additional hyper-parameters. To our knowledge, this is the first time that convergent off-policy learning methods have been extended to classical Actor-Critic methods with function approximation.


Rover Descent: Learning to optimize by learning to navigate on prototypical loss surfaces

arXiv.org Machine Learning

Learning to optimize - the idea that we can learn from data algorithms that optimize a numerical criterion - has recently been at the heart of a growing number of research efforts. One of the most challenging issues within this approach is to learn a policy that is able to optimize over classes of functions that are fairly different from the ones that it was trained on. We propose a novel way of framing learning to optimize as a problem of learning a good navigation policy on a partially observable loss surface. To this end, we develop Rover Descent, a solution that allows us to learn a fairly broad optimization policy from training on a small set of prototypical two-dimensional surfaces that encompasses the classically hard cases such as valleys, plateaus, cliffs and saddles and by using strictly zero-order information. We show that, without having access to gradient or curvature information, we achieve state-of-the-art convergence speed on optimization problems not presented at training time such as the Rosenbrock function and other hard cases in two dimensions. We extend our framework to optimize over high dimensional landscapes, while still handling only two-dimensional local landscape information and show good preliminary results.


Predict Responsibly: Increasing Fairness by Learning To Defer

arXiv.org Machine Learning

In many high-stakes ML applications, there are multiple decision-makers involved, both automated and human. The interaction between these agents often goes unaddressed in algorithmic development. In this work, we explore a simple version of this interaction with a two-stage framework containing an automated model and an external decision-maker. The model can choose to say IDK, and pass the decision downstream, as explored in rejection learning. We extend this concept by proposing learning to defer, which generalizes the rejection learning framework by considering the effect of the other agents in the decision-making process. We propose a learning algorithm which accounts for potential biases held by external decision-makers in a system. Experiments on real-world datasets demonstrate that learning to defer can make a system not only more accurate but also less biased. Even when operated by highly biased users, we show that deferring models can still greatly improve the fairness of the entire system.


What is Artificial General Intelligence? โ€“ Towards Data Science

#artificialintelligence

Artificial Intelligence is a branch of Computer Science ( or Science) which deals with the creation of intelligent systems. Intelligent systems are those systems which posses intelligence just like humans. The science of Artificial intelligence is not new, The term Artificial intelligence has been mentioned in manuscripts of Ancient Greece and Egypt. Greeks believed in god Hephaestus, also known as God of Blacksmiths, according to a Greek mythology Hephaestus made intelligent weapons for all Gods, in their view, the goal of Artificial intelligence is to: be helpful for people to achieve a certain goal, be able to operate automatically and be programmed in advance to react in different ways depending on the situation. Well, The term Artificial Intelligence has become popular in the field of Entertainment, we can see lots of movies based on the concept of Super intelligence.


Reinforcement learning woes, robot doggos, Amazon's homegrown AI chips, and more

#artificialintelligence

Here's a brief roundup of some interesting news from the AI world from the past two weeks, beyond what we've already reported. TL;DR: Deep RL sucks โ€“ A Google engineer has published a long, detailed blog post explaining the current frustrations in deep reinforcement learning, and why it doesn't live up to the hype. Reinforcement learning makes good headlines. Teaching agents to play games like Go well enough to beat human experts like Ke Jie fuels the man versus machine narrative. But a closer look at deep reinforcement learning, a method of machine learning used to train computers to complete a specific task, shows the practice is riddled with problems.


Introduction to Various Reinforcement Learning Algorithms. Part I (Q-Learning, SARSA, DQN, DDPG)

#artificialintelligence

Typically, a RL setup is composed of two components, an agent and an environment. Then environment refers to the object that the agent is acting on (e.g. the game itself in the Atari game), while the agent represents the RL algorithm. The environment starts by sending a state to the agent, which then based on its knowledge to take an action in response to that state. After that, the environment send a pair of next state and reward back to the agent. The agent will update its knowledge with the reward returned by the environment to evaluate its last action.