Undirected Networks
Structured World Belief for Reinforcement Learning in POMDP
Singh, Gautam, Peri, Skand, Kim, Junghyun, Kim, Hyunseok, Ahn, Sungjin
Object-centric world models provide structured representation of the scene and can be an important backbone in reinforcement learning and planning. However, existing approaches suffer in partially-observable environments due to the lack of belief states. In this paper, we propose Structured World Belief, a model for learning and inference of object-centric belief states. Inferred by Sequential Monte Carlo (SMC), our belief states provide multiple object-centric scene hypotheses. To synergize the benefits of SMC particles with object representations, we also propose a new object-centric dynamics model that considers the inductive bias of object permanence. This enables tracking of object states even when they are invisible for a long time. To further facilitate object tracking in this regime, we allow our model to attend flexibly to any spatial location in the image which was restricted in previous models. In experiments, we show that object-centric belief provides a more accurate and robust performance for filtering and generation. Furthermore, we show the efficacy of structured world belief in improving the performance of reinforcement learning, planning and supervised reasoning.
Pre-trained Language Models as Prior Knowledge for Playing Text-based Games
Singh, Ishika, Singh, Gargi, Modi, Ashutosh
Recently, text world games have been proposed to enable artificial agents to understand and reason about real-world scenarios. These text-based games are challenging for artificial agents, as it requires understanding and interaction using natural language in a partially observable environment. In this paper, we improve the semantic understanding of the agent by proposing a simple RL with LM framework where we use transformer-based language models with Deep RL models. We perform a detailed study of our framework to demonstrate how our model outperforms all existing agents on the popular game, Zork1, to achieve a score of 44.7, which is 1.6 higher than the state-of-the-art model. Our proposed approach also performs comparably to the state-of-the-art models on the other set of text games.
FEBR: Expert-Based Recommendation Framework for beneficial and personalized content
Lechiakh, Mohamed, Maurer, Alexandre
So far, most research on recommender systems focused on maintaining long-term user engagement and satisfaction, by promoting relevant and personalized content. However, it is still very challenging to evaluate the quality and the reliability of this content. In this paper, we propose FEBR (Expert-Based Recommendation Framework), an apprenticeship learning framework to assess the quality of the recommended content on online platforms. The framework exploits the demonstrated trajectories of an expert (assumed to be reliable) in a recommendation evaluation environment, to recover an unknown utility function. This function is used to learn an optimal policy describing the expert's behavior, which is then used in the framework to provide high-quality and personalized recommendations. We evaluate the performance of our solution through a user interest simulation environment (using RecSim). We simulate interactions under the aforementioned expert policy for videos recommendation, and compare its efficiency with standard recommendation methods. The results show that our approach provides a significant gain in terms of content quality, evaluated by experts and watched by users, while maintaining almost the same watch time as the baseline approaches.
Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses
Luo, Haipeng, Wei, Chen-Yu, Lee, Chung-Wei
Policy optimization methods are among the most widely-used methods in reinforcement learning. Its empirical success has been demonstrated in various domains such as computer games [Schulman et al., 2017] and robotics [Levine and Koltun, 2013]. However, due to its local-search nature, global optimality guarantees of policy optimization often rely on unrealistic assumptions to ensure global exploration (see e.g., [Abbasi-Yadkori et al., 2019, Agarwal et al., 2020b, Neu and Olkhovskaya, 2020, Wei et al., 2021]), making it theoretically less appealing compared to other methods. Motivated by this issue, a line of recent works [Cai et al., 2020, Shani et al., 2020, Agarwal et al., 2020a, Zanette et al., 2021] equip policy optimization with global exploration by adding exploration bonuses to the update, and prove favorable guarantees even without making extra exploratory assumptions. Moreover, they all demonstrate some robustness aspect of policy optimization (such as being able to handle adversarial losses or a certain degree of model misspecification). Despite these important progresses, however, many limitations still exist, including worse regret rates comparing to the best value-based or model-based approaches [Shani et al., 2020, Agarwal et al., 2020a, Zanette et al., 2021], or requiring full-information feedback on the entire loss function (as opposed to the more realistic bandit feedback) [Cai et al., 2020]. To address these issues, in this work, we propose a new type of exploration bonuses called dilated bonuses, which satisfies a certain dilated Bellman equation and provably leads to improved exploration compared to existing works (Section 3). We apply this general idea to advance the state-of-the-art of policy optimization for learning finite-horizon episodic MDPs with adversarial losses and bandit feedback. More specifically, our main results are: - First, in the tabular setting, addressing the main open question left in [Shani et al., 2020], we improve their Õ(T
Model Uncertainty and Correctability for Directed Graphical Models
Birmpa, Panagiota, Feng, Jinchao, Katsoulakis, Markos A., Rey-Bellet, Luc
Probabilistic graphical models are a fundamental tool in probabilistic modeling, machine learning and artificial intelligence. They allow us to integrate in a natural way expert knowledge, physical modeling, heterogeneous and correlated data and quantities of interest. For exactly this reason, multiple sources of model uncertainty are inherent within the modular structure of the graphical model. In this paper we develop information-theoretic, robust uncertainty quantification methods and non-parametric stress tests for directed graphical models to assess the effect and the propagation through the graph of multi-sourced model uncertainties to quantities of interest. These methods allow us to rank the different sources of uncertainty and correct the graphical model by targeting its most impactful components with respect to the quantities of interest. Thus, from a machine learning perspective, we provide a mathematically rigorous approach to correctability that guarantees a systematic selection for improvement of components of a graphical model while controlling potential new errors created in the process in other parts of the model. We demonstrate our methods in two physico-chemical examples, namely quantum scale-informed chemical kinetics and materials screening to improve the efficiency of fuel cells.
Vision-Based Autonomous Car Racing Using Deep Imitative Reinforcement Learning
Cai, Peide, Wang, Hengli, Huang, Huaiyang, Liu, Yuxuan, Liu, Ming
Autonomous car racing is a challenging task in the robotic control area. Traditional modular methods require accurate mapping, localization and planning, which makes them computationally inefficient and sensitive to environmental changes. Recently, deep-learning-based end-to-end systems have shown promising results for autonomous driving/racing. However, they are commonly implemented by supervised imitation learning (IL), which suffers from the distribution mismatch problem, or by reinforcement learning (RL), which requires a huge amount of risky interaction data. In this work, we present a general deep imitative reinforcement learning approach (DIRL), which successfully achieves agile autonomous racing using visual inputs. The driving knowledge is acquired from both IL and model-based RL, where the agent can learn from human teachers as well as perform self-improvement by safely interacting with an offline world model. We validate our algorithm both in a high-fidelity driving simulation and on a real-world 1/20-scale RC-car with limited onboard computation. The evaluation results demonstrate that our method outperforms previous IL and RL methods in terms of sample efficiency and task performance. Demonstration videos are available at https://caipeide.github.io/autorace-dirl/
Implicit Communication as Minimum Entropy Coupling
Sokota, Samuel, de Witt, Christian Schroeder, Igl, Maximilian, Zintgraf, Luisa, Torr, Philip, Whiteson, Shimon, Foerster, Jakob
In many common-payoff games, achieving good performance requires players to develop protocols for communicating their private information implicitly -- i.e., using actions that have non-communicative effects on the environment. Multi-agent reinforcement learning practitioners typically approach this problem using independent learning methods in the hope that agents will learn implicit communication as a byproduct of expected return maximization. Unfortunately, independent learning methods are incapable of doing this in many settings. In this work, we isolate the implicit communication problem by identifying a class of partially observable common-payoff games, which we call implicit referential games, whose difficulty can be attributed to implicit communication. Next, we introduce a principled method based on minimum entropy coupling that leverages the structure of implicit referential games, yielding a new perspective on implicit communication. Lastly, we show that this method can discover performant implicit communication protocols in settings with very large spaces of messages.
Efficient Bayesian Sampling Using Normalizing Flows to Assist Markov Chain Monte Carlo Methods
Gabrié, Marylou, Rotskoff, Grant M., Vanden-Eijnden, Eric
Markov Chain Monte Carlo (MCMC) algorithms (Liu, Since no data set from the target posterior distribution 2008) are nowadays the methods of choice to sample complex is available beforehand, the flow is typically posterior distributions. MCMC methods generate a trained using the reverse Kullback-Leibler (KL) sequence of configurations over which the time average of divergence that only requires samples from a base any suitable observable converges towards its ensemble average distribution. This strategy may perform poorly over some target distribution, here the posterior. This when the posterior is complicated and hard to is achieved by proposing new samples from a proposal density sample with an untrained normalizing flow. Here that is easy to sample, then accepting or rejecting them we explore a distinct training strategy, using the using a criterion that guarantees that the transition kernel of direct KL divergence as loss, in which samples the chain is in detailed balance with respect to the posterior from the posterior are generated by (i) assisting density: a popular choice is Metropolis-Hastings criterion.
High-level Decisions from a Safe Maneuver Catalog with Reinforcement Learning for Safe and Cooperative Automated Merging
Kamran, Danial, Ren, Yu, Lauer, Martin
Reinforcement learning (RL) has recently been used for solving challenging decision-making problems in the context of automated driving. However, one of the main drawbacks of the presented RL-based policies is the lack of safety guarantees, since they strive to reduce the expected number of collisions but still tolerate them. In this paper, we propose an efficient RL-based decision-making pipeline for safe and cooperative automated driving in merging scenarios. The RL agent is able to predict the current situation and provide high-level decisions, specifying the operation mode of the low level planner which is responsible for safety. In order to learn a more generic policy, we propose a scalable RL architecture for the merging scenario that is not sensitive to changes in the environment configurations. According to our experiments, the proposed RL agent can efficiently identify cooperative drivers from their vehicle state history and generate interactive maneuvers, resulting in faster and more comfortable automated driving. At the same time, thanks to the safety constraints inside the planner, all of the maneuvers are collision free and safe.
Towards quantifying information flows: relative entropy in deep neural networks and the renormalization group
Erdmenger, Johanna, Grosvenor, Kevin T., Jefferson, Ro
We investigate the analogy between the renormalization group (RG) and deep neural networks, wherein subsequent layers of neurons are analogous to successive steps along the RG. In particular, we quantify the flow of information by explicitly computing the relative entropy or Kullback-Leibler divergence in both the one- and two-dimensional Ising models under decimation RG, as well as in a feedforward neural network as a function of depth. We observe qualitatively identical behavior characterized by the monotonic increase to a parameter-dependent asymptotic value. On the quantum field theory side, the monotonic increase confirms the connection between the relative entropy and the c-theorem. For the neural networks, the asymptotic behavior may have implications for various information maximization methods in machine learning, as well as for disentangling compactness and generalizability. Furthermore, while both the two-dimensional Ising model and the random neural networks we consider exhibit non-trivial critical points, the relative entropy appears insensitive to the phase structure of either system. In this sense, more refined probes are required in order to fully elucidate the flow of information in these models.