Undirected Networks
Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition
Zhang, Zihan, Zhou, Yuan, Ji, Xiangyang
We study the reinforcement learning problem in the setting of finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and episode length $H$. We propose a model-free algorithm UCB-Advantage and prove that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play. Our regret bound improves upon the results of [Jin et al., 2018] and matches the best known model-based algorithms as well as the information theoretic lower bound up to logarithmic factors. We also show that UCB-Advantage achieves low local switching cost and applies to concurrent reinforcement learning, improving upon the recent results of [Bai et al., 2019].
Beyond Domain APIs: Task-oriented Conversational Modeling with Unstructured Knowledge Access
Kim, Seokhwan, Eric, Mihail, Gopalakrishnan, Karthik, Hedayatnia, Behnam, Liu, Yang, Hakkani-Tur, Dilek
Most prior work on task-oriented dialogue systems are restricted to a limited coverage of domain APIs, while users oftentimes have domain related requests that are not covered by the APIs. In this paper, we propose to expand coverage of task-oriented dialogue systems by incorporating external unstructured knowledge sources. We define three sub-tasks: knowledge-seeking turn detection, knowledge selection, and knowledge-grounded response generation, which can be modeled individually or jointly. We introduce an augmented version of MultiWOZ 2.1, which includes new out-of-API-coverage turns and responses grounded on external knowledge sources. We present baselines for each sub-task using both conventional and neural approaches. Our experimental results demonstrate the need for further research in this direction to enable more informative conversational systems.
State Action Separable Reinforcement Learning
Zhang, Ziyao, Ma, Liang, Leung, Kin K., Poularakis, Konstantinos, Srivatsa, Mudhakar
Reinforcement Learning (RL) based methods have seen their paramount successes in solving serial decision-making and control problems in recent years. For conventional RL formulations, Markov Decision Process (MDP) and state-action-value function are the basis for the problem modeling and policy evaluation. However, several challenging issues still remain. Among most cited issues, the enormity of state/action space is an important factor that causes inefficiency in accurately approximating the state-action-value function. We observe that although actions directly define the agents' behaviors, for many problems the next state after a state transition matters more than the action taken, in determining the return of such a state transition. In this regard, we propose a new learning paradigm, State Action Separable Reinforcement Learning (sasRL), wherein the action space is decoupled from the value function learning process for higher efficiency. Then, a light-weight transition model is learned to assist the agent to determine the action that triggers the associated state transition. In addition, our convergence analysis reveals that under certain conditions, the convergence time of sasRL is $O(T^{1/k})$, where $T$ is the convergence time for updating the value function in the MDP-based formulation and $k$ is a weighting factor. Experiments on several gaming scenarios show that sasRL outperforms state-of-the-art MDP-based RL algorithms by up to $75\%$.
Inference from Stationary Time Sequences via Learned Factor Graphs
Shlezinger, Nir, Farsad, Nariman, Eldar, Yonina C., Goldsmith, Andrea J.
The design of methods for inference from time sequences has traditionally relied on statistical models that describe the relation between a latent desired sequence and the observed one. A broad family of model-based algorithms have been derived to carry out inference at controllable complexity using recursive computations over the factor graph representing the underlying distribution. An alternative model-agnostic approach utilizes machine learning (ML) methods. Here we propose a framework that combines model-based inference algorithms and data-driven ML tools for stationary time sequences. In the proposed approach, neural networks are developed to separately learn specific components of a factor graph describing the distribution of the time sequence, rather than the complete inference task. By exploiting stationary properties of this distribution, the resulting approach can be applied to sequences of varying temporal duration. Additionally, this approach facilitates the use of compact neural networks which can be trained with small training sets, or alternatively, can be used to improve upon existing deep inference systems. We present an inference algorithm based on learned stationary factor graphs, referred to as StaSPNet, which learns to implement the sum product scheme from labeled data, and can be applied to sequences of different lengths. Our experimental results demonstrate the ability of the proposed StaSPNet to learn to carry out accurate inference from small training sets for sleep stage detection using the Sleep-EDF dataset, as well as for symbol detection in digital communications with unknown channels.
Advanced AI: Deep Reinforcement Learning in Python
Online Courses Udemy Advanced AI: Deep Reinforcement Learning in Python, The Complete Guide to Mastering Artificial Intelligence using Deep Learning and Neural Networks Created by Lazy Programmer Team, Lazy Programmer Inc. English [Auto-generated], Indonesian [Auto-generated], 5 more Students also bought Deep Learning: Convolutional Neural Networks in Python Deep Learning: Recurrent Neural Networks in Python Unsupervised Machine Learning Hidden Markov Models in Python Bayesian Machine Learning in Python: A/B Testing Data Science: Supervised Machine Learning in Python Preview this course GET COUPON CODE Description This course is all about the application of deep learning and neural networks to reinforcement learning. If you've taken my first reinforcement learning class, then you know that reinforcement learning is on the bleeding edge of what we can do with AI. Specifically, the combination of deep learning with reinforcement learning has led to AlphaGo beating a world champion in the strategy game Go, it has led to self-driving cars, and it has led to machines that can play video games at a superhuman level. Reinforcement learning has been around since the 70s but none of this has been possible until now. The world is changing at a very fast pace.
DASC: Towards A Road Damage-Aware Social-Media-Driven Car Sensing Framework for Disaster Response Applications
Rashid, Md Tahmid, Daniel, null, Zhang, null, Wang, Dong
While vehicular sensor networks (VSNs) have earned the stature of a mobile sensing paradigm utilizing sensors built into cars, they have limited sensing scopes since car drivers only opportunistically discover new events. Conversely, social sensing is emerging as a new sensing paradigm where measurements about the physical world are collected from humans. In contrast to VSNs, social sensing is more pervasive, but one of its key limitations lies in its inconsistent reliability stemming from the data contributed by unreliable human sensors. In this paper, we present DASC, a road Damage-Aware Social-media-driven Car sensing framework that exploits the collective power of social sensing and VSNs for reliable disaster response applications. However, integrating VSNs with social sensing introduces a new set of challenges: i) How to leverage noisy and unreliable social signals to route the vehicles to accurate regions of interest? ii) How to tackle the inconsistent availability (e.g., churns) caused by car drivers being rational actors? iii) How to efficiently guide the cars to the event locations with little prior knowledge of the road damage caused by the disaster, while also handling the dynamics of the physical world and social media? The DASC framework addresses the above challenges by establishing a novel hybrid social-car sensing system that employs techniques from game theory, feedback control, and Markov Decision Process (MDP). In particular, DASC distills signals emitted from social media and discovers the road damages to effectively drive cars to target areas for verifying emergency events. We implement and evaluate DASC in a reputed vehicle simulator that can emulate real-world disaster response scenarios. The results of a real-world application demonstrate the superiority of DASC over current VSNs-based solutions in detection accuracy and efficiency.
Hidden Markov models are recurrent neural networks: A disease progression modeling application
Baucum, Matt, Khojandi, Anahita, Papamarkou, Theodore
Hidden Markov models (HMMs) are commonly used for sequential data modeling when the true state of the system is not fully known. We formulate a special case of recurrent neural networks (RNNs), which we name hidden Markov recurrent neural networks (HMRNNs), and prove that each HMRNN has the same likelihood function as a corresponding discrete-observation HMM. We experimentally validate this theoretical result on synthetic datasets by showing that parameter estimates from HMRNNs are numerically close to those obtained from HMMs via the Baum-Welch algorithm. We demonstrate our method's utility in a case study on Alzheimer's disease progression, in which we augment HMRNNs with other predictive neural networks. The augmented HMRNN yields parameter estimates that offer a novel clinical interpretation and fit the patient data better than HMM parameter estimates from the Baum-Welch algorithm.
Visual Transfer for Reinforcement Learning via Wasserstein Domain Confusion
We introduce Wasserstein Adversarial Proximal Policy Optimization (WAPPO), a novel algorithm for visual transfer in Reinforcement Learning that explicitly learns to align the distributions of extracted features between a source and target task. WAPPO approximates and minimizes the Wasserstein-1 distance between the distributions of features from source and target domains via a novel Wasserstein Confusion objective. WAPPO outperforms the prior state-of-the-art in visual transfer and successfully transfers policies across Visual Cartpole and two instantiations of 16 OpenAI Procgen environments.
Kernel Taylor-Based Value Function Approximation for Continuous-State Markov Decision Processes
Xu, Junhong, Yin, Kai, Liu, Lantao
We propose a principled kernel-based policy iteration algorithm to solve the continuous-state Markov Decision Processes (MDPs). In contrast to most decision-theoretic planning frameworks, which assume fully known state transition models, we design a method that eliminates such a strong assumption, which is oftentimes extremely difficult to engineer in reality. To achieve this, we first apply the second-order Taylor expansion of the value function. The Bellman optimality equation is then approximated by a partial differential equation, which only relies on the first and second moments of the transition model. By combining the kernel representation of value function, we then design an efficient policy iteration algorithm whose policy evaluation step can be represented as a linear system of equations characterized by a finite set of supporting states. We have validated the proposed method through extensive simulations in both simplified and realistic planning scenarios, and the experiments show that our proposed approach leads to a much superior performance over several baseline methods.
Maximizing Cumulative User Engagement in Sequential Recommendation: An Online Optimization Perspective
Zhao, Yifei, Zhou, Yu-Hang, Ou, Mingdong, Xu, Huan, Li, Nan
To maximize cumulative user engagement (e.g. cumulative clicks) in sequential recommendation, it is often needed to tradeoff two potentially conflicting objectives, that is, pursuing higher immediate user engagement (e.g., click-through rate) and encouraging user browsing (i.e., more items exposured). Existing works often study these two tasks separately, thus tend to result in sub-optimal results. In this paper, we study this problem from an online optimization perspective, and propose a flexible and practical framework to explicitly tradeoff longer user browsing length and high immediate user engagement. Specifically, by considering items as actions, user's requests as states and user leaving as an absorbing state, we formulate each user's behavior as a personalized Markov decision process (MDP), and the problem of maximizing cumulative user engagement is reduced to a stochastic shortest path (SSP) problem. Meanwhile, with immediate user engagement and quit probability estimation, it is shown that the SSP problem can be efficiently solved via dynamic programming. Experiments on real-world datasets demonstrate the effectiveness of the proposed approach. Moreover, this approach is deployed at a large E-commerce platform, achieved over 7% improvement of cumulative clicks.