Ishfaq, Haque
Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning
Ishfaq, Haque, Wang, Guangyuan, Islam, Sami Nur, Precup, Doina
Existing actor-critic algorithms, which are popular for continuous control reinforcement learning (RL) tasks, suffer from poor sample efficiency due to lack of principled exploration mechanism within them. Motivated by the success of Thompson sampling for efficient exploration in RL, we propose a novel model-free RL algorithm, Langevin Soft Actor Critic (LSAC), which prioritizes enhancing critic learning through uncertainty estimation over policy optimization. LSAC employs three key innovations: approximate Thompson sampling through distributional Langevin Monte Carlo (LMC) based $Q$ updates, parallel tempering for exploring multiple modes of the posterior of the $Q$ function, and diffusion synthesized state-action samples regularized with $Q$ action gradients. Our extensive experiments demonstrate that LSAC outperforms or matches the performance of mainstream model-free RL algorithms for continuous control tasks. Notably, LSAC marks the first successful application of an LMC based Thompson sampling in continuous control tasks with continuous action spaces.
Offline Multitask Representation Learning for Reinforcement Learning
Ishfaq, Haque, Nguyen-Tang, Thanh, Feng, Songtao, Arora, Raman, Wang, Mengdi, Yin, Ming, Precup, Doina
We study offline multitask representation learning in reinforcement learning (RL), where a learner is provided with an offline dataset from different tasks that share a common representation and is asked to learn the shared representation. We theoretically investigate offline multitask low-rank RL, and propose a new algorithm called MORL for offline multitask representation learning. Furthermore, we examine downstream RL in reward-free, offline and online scenarios, where a new task is introduced to the agent that shares the same representation as the upstream offline tasks. Our theoretical results demonstrate the benefits of using the learned representation from the upstream offline task instead of directly learning the representation of the low-rank model.
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo
Ishfaq, Haque, Lan, Qingfeng, Xu, Pan, Mahmood, A. Rupam, Precup, Doina, Anandkumar, Anima, Azizzadenesheli, Kamyar
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{5/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
Randomized Exploration for Reinforcement Learning with General Value Function Approximation
Ishfaq, Haque, Cui, Qiwen, Nguyen, Viet, Ayoub, Alex, Yang, Zhuoran, Wang, Zhaoran, Precup, Doina, Yang, Lin F.
We propose a model-free reinforcement learning In this work, we propose an exploration strategy inspired algorithm inspired by the popular randomized by the popular Randomized Least Squares Value Iteration least squares value iteration (RLSVI) algorithm (RLSVI) algorithm (Osband et al., 2016b; Russo, 2019; as well as the optimism principle. Unlike Zanette et al., 2020a) as well as by the optimism principle existing upper-confidence-bound (UCB) based (Brafman & Tennenholtz, 2001; Jaksch et al., 2010; Jin approaches, which are often computationally intractable, et al., 2018; 2020; Wang et al., 2020), which is efficient in our algorithm drives exploration by simply both statistical and computational sense, and can be easily perturbing the training data with judiciously plugged into common RL algorithms, including UCB-VI chosen i.i.d.
TVAE: Triplet-Based Variational Autoencoder using Metric Learning
Ishfaq, Haque, Hoogi, Assaf, Rubin, Daniel
Deep metric learning has been demonstrated to be highly effective in learning semantic representation and encoding information that can be used to measure data similarity, by relying on the embedding learned from metric learning. At the same time, variational autoencoder (VAE) has widely been used to approximate inference and proved to have a good performance for directed probabilistic models. However, for traditional VAE, the data label or feature information are intractable. Similarly, traditional representation learning approaches fail to represent many salient aspects of the data. In this project, we propose a novel integrated framework to learn latent embedding in VAE by incorporating deep metric learning. The features are learned by optimizing a triplet loss on the mean vectors of VAE in conjunction with standard evidence lower bound (ELBO) of VAE. This approach, which we call Triplet based Variational Autoencoder (TVAE), allows us to capture more fine-grained information in the latent embedding. Our model is tested on MNIST data set and achieves a high triplet accuracy of 95.60% while the traditional VAE (Kingma & Welling, 2013) achieves triplet accuracy of 75.08%.