Littman, Michael L.

Equivalence Between Wasserstein and Value-Aware Model-based Reinforcement Learning Machine Learning

Learning a generative model is a key component of model-based reinforcement learning. Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging. Recently Farahmand et al. (2017) proposed a value-aware (VAML) objective that captures the structure of value function during model learning. Using tools from Lipschitz continuity, we show that minimizing the VAML objective is in fact equivalent to minimizing the Wasserstein metric.

Lipschitz Continuity in Model-based Reinforcement Learning Artificial Intelligence

Model-based reinforcement-learning methods learn transition and reward models and use them to guide behavior. We analyze the impact of learning models that are Lipschitz continuous---the distance between function values for two inputs is bounded by a linear function of the distance between the inputs. Our first result shows a tight bound on model errors for multi-step predictions with Lipschitz continuous models. We go on to prove an error bound for the value-function estimate arising from such models and show that the estimated value function is itself Lipschitz continuous. We conclude with empirical results that demonstrate significant benefits to enforcing Lipschitz continuity of neural net models during reinforcement learning.

Bandit-Based Solar Panel Control

AAAI Conferences

Solar panels sustainably harvest energy from the sun. To improve performance, panels are often equipped with a tracking mechanism that computes the sun’s position in the sky throughout the day. Based on the tracker’s estimate of the sun’s location, a controller orients the panel to minimize the angle of incidence between solar radiant energy and the photovoltaic cells on the surface of the panel, increasing total energy harvested. Prior work has developed efficient tracking algorithms that accurately compute the sun’s location to facilitate solar tracking and control. However, always pointing a panel directly at the sun does not account for diffuse irradiance in the sky, reflected irradiance from the ground and surrounding surfaces, power required to reorient the panel, shading effects from neighboring panels and foliage, or changing weather conditions (such as clouds), all of which are contributing factors to the total energy harvested by a fleet of solar panels. In this work, we show that a bandit-based approach can increase the total energy harvested by solar panels by learning to dynamically account for such other factors. Our contribution is threefold: (1) the development of a test bed based on typical solar and irradiance models for experimenting with solar panel control using a variety of learning methods, (2) simulated validation that bandit algorithms can effectively learn to control solar panels, and (3) the design and construction of an intelligent solar panel prototype that learns to angle itself using bandit algorithms.

Modeling Latent Attention Within Neural Networks Artificial Intelligence

Deep neural networks are able to solve tasks across a variety of domains and modalities of data. Despite many empirical successes, we lack the ability to clearly understand and interpret the learned internal mechanisms that contribute to such effective behaviors or, more critically, failure modes. In this work, we present a general method for visualizing an arbitrary neural network's inner mechanisms and their power and limitations. Our dataset-centric method produces visualizations of how a trained network attends to components of its inputs. The computed "attention masks" support improved interpretability by highlighting which input attributes are critical in determining output. We demonstrate the effectiveness of our framework on a variety of deep neural network architectures in domains from computer vision, natural language processing, and reinforcement learning. The primary contribution of our approach is an interpretable visualization of attention that provides unique insights into the network's underlying decision-making process irrespective of the data modality.

Summable Reparameterizations of Wasserstein Critics in the One-Dimensional Setting Machine Learning

Generative adversarial networks (GANs) are an exciting alternative to algorithms for solving density estimation problems---using data to assess how likely samples are to be drawn from the same distribution. Instead of explicitly computing these probabilities, GANs learn a generator that can match the given probabilistic source. This paper looks particularly at this matching capability in the context of problems with one-dimensional outputs. We identify a class of function decompositions with properties that make them well suited to the critic role in a leading approach to GANs known as Wasserstein GANs. We show that Taylor and Fourier series decompositions belong to our class, provide examples of these critics outperforming standard GAN approaches, and suggest how they can be scaled to higher dimensional problems in the future.

Advantages and Limitations of using Successor Features for Transfer in Reinforcement Learning Machine Learning

One question central to Reinforcement Learning is how to learn a feature representation that supports algorithm scaling and re-use of learned information from different tasks. Successor Features approach this problem by learning a feature representation that satisfies a temporal constraint. We present an implementation of an approach that decouples the feature representation from the reward function, making it suitable for transferring knowledge between domains. We then assess the advantages and limitations of using Successor Features for transfer.

Planning with Abstract Markov Decision Processes

AAAI Conferences

Robots acting in human-scale environments must plan under uncertainty in large state–action spaces and face constantly changing reward functions as requirements and goals change. Planning under uncertainty in large state–action spaces requires hierarchical abstraction for efficient computation. We introduce a new hierarchical planning framework called Abstract Markov Decision Processes (AMDPs) that can plan in a fraction of the time needed for complex decision making in ordinary MDPs. AMDPs provide abstract states, actions, and transition dynamics in multiple layers above a base-level “flat” MDP. AMDPs decompose problems into a series of subtasks with both local reward and local transition functions used to create policies for subtasks. The resulting hierarchical planning method is independently optimal at each level of abstraction, and is recursively optimal when the local reward and transition functions are correct. We present empirical results showing significantly improved planning speed, while maintaining solution quality, in the Taxi domain and in a mobile-manipulation robotics problem. Furthermore, our approach allows specification of a decision-making model for a mobile-manipulation problem on a Turtlebot, spanning from low-level control actions operating on continuous variables all the way up through high-level object manipulation tasks.

An Alternative Softmax Operator for Reinforcement Learning Artificial Intelligence

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.

Towards Behavior-Aware Model Learning from Human-Generated Trajectories

AAAI Conferences

Inverse reinforcement learning algorithms recover an unknown reward function for a Markov decision process, based on observations of user behaviors that optimize this reward function. Here we consider the complementary problem of learning the unknown transition dynamics of an MDP based on such observations. We describe the behavior-aware modeling (BAM) algorithm, which learns models of transition dynamics from user generated state-action trajectories. BAM makes assumptions about how users select their actions that are similar to those used in inverse reinforcement learning, and searches for a model that maximizes the probability of the observed actions. The BAM algorithm is based on policy gradient algorithms, essentially reversing the roles of the policy and transition distribution in those algorithms. As a result, BAM is highly flexible, and can be applied to continuous state spaces using a wide variety of model representations. In this preliminary work, we discuss why the model learning problem is interesting, describe algorithms to solve this problem, and discuss directions for future work.

Reinforcement Learning as a Framework for Ethical Decision Making

AAAI Conferences

Emerging AI systems will be making more and more decisions that impact the lives of humans in a significant way. It is essential, then, that these AI systems make decisions that take into account the desires, goals, and preferences of other people, while simultaneously learning about what those preferences are. In this work, we argue that the reinforcement-learning framework achieves the appropriate generality required to theorize about an idealized ethical artificial agent, and offers the proper foundations for grounding specific questions about ethical learning and decision making that can promote further scientific investigation. We define an idealized formalism for an ethical learner, and conduct experiments on two toy ethical dilemmas, demonstrating the soundness and flexibility of our approach. Lastly, we identify several critical challenges for future advancement in the area that can leverage our proposed framework.