Collaborating Authors


Getting Industrial About The Hybrid Computing And AI Revolution


For oil and gas companies looking at drilling wells in a new field, the issue becomes one of return vs. cost. The goal is simple enough: install the fewest number of wells that will draw them the most oil or gas from the underground reservoirs for the longest amount of time. The more wells installed, the higher the cost and the larger the impact on the environment. However, finding the right well placements quickly becomes a highly complex math problem. Too few wells sited in the wrong places leaves a lot of resources in the ground.

A Hybrid AI Approach to Optimizing Oil Field Planning


What's the best way to arrange wells in an oil or gas field? It's a simple enough question, but the answer can be very complex. Now a Cal Tech/JPL spinoff is developing a new approach that blends traditional HPC simulation with deep reinforcement learning running on GPUs to optimize energy extraction. The well placement game is a familiar one to oil and gas companies. For years, they have been using simulators running atop HPC systems to model underground reservoirs.

A Max-Min Entropy Framework for Reinforcement Learning Artificial Intelligence

In this paper, we propose a max-min entropy framework for reinforcement learning (RL) to overcome the limitation of the maximum entropy RL framework in model-free sample-based learning. Whereas the maximum entropy RL framework guides learning for policies to reach states with high entropy in the future, the proposed max-min entropy framework aims to learn to visit states with low entropy and maximize the entropy of these low-entropy states to promote exploration. For general Markov decision processes (MDPs), an efficient algorithm is constructed under the proposed max-min entropy framework based on disentanglement of exploration and exploitation. Numerical results show that the proposed algorithm yields drastic performance improvement over the current state-of-the-art RL algorithms.

MADE: Exploration via Maximizing Deviation from Explored Regions Artificial Intelligence

In online reinforcement learning (RL), efficient exploration remains particularly challenging in high-dimensional environments with sparse rewards. In low-dimensional environments, where tabular parameterization is possible, count-based upper confidence bound (UCB) exploration methods achieve minimax near-optimal rates. However, it remains unclear how to efficiently implement UCB in realistic RL tasks that involve non-linear function approximation. To address this, we propose a new exploration approach via \textit{maximizing} the deviation of the occupancy of the next policy from the explored regions. We add this term as an adaptive regularizer to the standard RL objective to balance exploration vs. exploitation. We pair the new objective with a provably convergent algorithm, giving rise to a new intrinsic reward that adjusts existing bonuses. The proposed intrinsic reward is easy to implement and combine with other existing RL algorithms to conduct exploration. As a proof of concept, we evaluate the new intrinsic reward on tabular examples across a variety of model-based and model-free algorithms, showing improvements over count-only exploration strategies. When tested on navigation and locomotion tasks from MiniGrid and DeepMind Control Suite benchmarks, our approach significantly improves sample efficiency over state-of-the-art methods. Our code is available at

Randomized Exploration for Reinforcement Learning with General Value Function Approximation Machine Learning

We propose a model-free reinforcement learning In this work, we propose an exploration strategy inspired algorithm inspired by the popular randomized by the popular Randomized Least Squares Value Iteration least squares value iteration (RLSVI) algorithm (RLSVI) algorithm (Osband et al., 2016b; Russo, 2019; as well as the optimism principle. Unlike Zanette et al., 2020a) as well as by the optimism principle existing upper-confidence-bound (UCB) based (Brafman & Tennenholtz, 2001; Jaksch et al., 2010; Jin approaches, which are often computationally intractable, et al., 2018; 2020; Wang et al., 2020), which is efficient in our algorithm drives exploration by simply both statistical and computational sense, and can be easily perturbing the training data with judiciously plugged into common RL algorithms, including UCB-VI chosen i.i.d.

Efficient Hierarchical Exploration with Stable Subgoal Representation Learning Artificial Intelligence

Goal-conditioned hierarchical reinforcement learning (HRL) serves as a successful approach to solving complex and temporally extended tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. However, online subgoal representation learning exacerbates the non-stationary issue of HRL and introduces challenges for exploration in high-level policy learning. In this paper, we propose a state-specific regularization that stabilizes subgoal embeddings in well-explored areas while allowing representation updates in less explored state regions. Benefiting from this stable representation, we design measures of novelty and potential for subgoals, and develop an efficient hierarchical exploration strategy that actively seeks out new promising subgoals and states. Experimental results show that our method significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards and further demonstrate the stability and efficiency of the subgoal representation learning of this work, which promotes superior policy learning.

Return-based Scaling: Yet Another Normalisation Trick for Deep RL Artificial Intelligence

Scaling issues are mundane yet irritating for practitioners of reinforcement learning. Error scales vary across domains, tasks, and stages of learning; sometimes by many orders of magnitude. This can be detrimental to learning speed and stability, create interference between learning tasks, and necessitate substantial tuning. We revisit this topic for agents based on temporal-difference learning, sketch out some desiderata and investigate scenarios where simple fixes fall short. The mechanism we propose requires neither tuning, clipping, nor adaptation. We validate its effectiveness and robustness on the suite of Atari games. Our scaling method turns out to be particularly helpful at mitigating interference, when training a shared neural network on multiple targets that differ in reward scale or discounting.

Generative Actor-Critic: An Off-policy Algorithm Using the Push-forward Model Artificial Intelligence

Model-free deep reinforcement learning has achieved great success in many domains, such as video games, recommendation systems and robotic control tasks. In continuous control tasks, widely used policies with Gaussian distributions results in ineffective exploration of environments and limited performance of algorithms in many cases. In this paper, we propose a density-free off-policy algorithm, Generative Actor-Critic(GAC), using the push-forward model to increase the expressiveness of policies, which also includes an entropy-like technique, MMD-entropy regularizer, to balance the exploration and exploitation. Additionnally, we devise an adaptive mechanism to automatically scale this regularizer, which further improves the stability and robustness of GAC. The experiment results show that push-forward policies possess desirable features, such as multi-modality, which can improve the efficiency of exploration and asymptotic performance of algorithms obviously.

On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning Artificial Intelligence

The lottery ticket hypothesis questions the role of overparameterization in supervised deep learning. But how is the performance of winning lottery tickets affected by the distributional shift inherent to reinforcement learning problems? In this work, we address this question by comparing sparse agents who have to address the non-stationarity of the exploration-exploitation problem with supervised agents trained to imitate an expert. We show that feed-forward networks trained via reinforcement learning and imitation learning can be pruned to the same level of sparsity, suggesting that the distributional shift has a limited impact on the size of winning tickets. Using a set of carefully designed baseline conditions, we find that the majority of the lottery ticket effect in both learning paradigms can be attributed to the identified mask rather than the weight initialization. The input layer mask selectively prunes entire input dimensions that turn out to be irrelevant for the task at hand. At a moderate level of sparsity the mask identified by iterative magnitude pruning yields minimal task-relevant representations, i.e., an interpretable inductive bias. Finally, we propose a simple initialization rescaling which promotes the robust identification of sparse task representations in low-dimensional control tasks.

Learning swimming escape patterns under energy constraints Artificial Intelligence

Aquatic organisms involved in predator-prey interactions perform impressive feats of fluid manipulation to enhance their chances of survival [1-8]. Since early studies where prey fish were reported to rapidly accelerate from rest by bending into a C-shape and unfurling their tail [9-12], impulsive locomotion patterns have been the subject of intense investigation. Studying escape strategies of prey fish has led to the discovery of sensing mechanisms [13-15], dedicated neural circuits [16-19], and bio-mechanic principles [20, 21]. From the perspective of hydrodynamics, several studies have sought to understand the C-start escape response and how it imparts momentum to the surrounding fluid [22-27]. However, experiments and observations indicate that swimming escapes can take a variety of forms. For example, after the initial burst from rest, many fish are seen coasting instead of swimming continuously [11, 28, 29]. Furthermore, theoretical [30-32] as well as experimental [33] studies have suggested that intermittent swimming styles, termed burst-coast swimming, can be more efficient than continuous swimming when maximizing distance given a fixed amount of energy. This raises the question of when and why different swimming escape patterns are employed in nature, and which biophysical cost functions they optimize. Given a cost function, reverse engineering methodologies have been employed to identify links to resulting swimming patterns e.g.