Markov Models
How RL Agents Behave When Their Actions Are Modified
Langlois, Eric D., Everitt, Tom
Reinforcement learning in complex environments may require supervision to prevent the agent from attempting dangerous actions. As a result of supervisor intervention, the executed action may differ from the action specified by the policy. How does this affect learning? We present the Modified-Action Markov Decision Process, an extension of the MDP model that allows actions to differ from the policy. We analyze the asymptotic behaviours of common reinforcement learning algorithms in this setting and show that they adapt in different ways: some completely ignore modifications while others go to various lengths in trying to avoid action modifications that decrease reward. By choosing the right algorithm, developers can prevent their agents from learning to circumvent interruptions or constraints, and better control agent responses to other kinds of action modification, like self-damage.
Jira: a Kurdish Speech Recognition System Designing and Building Speech Corpus and Pronunciation Lexicon
Veisi, Hadi, Hosseini, Hawre, Mohammadamini, Mohammad, Fathy, Wirya, Mahmudi, Aso
In this paper, we introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira. The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries, but due to the lack of speech and text resources, there is no speech recognition system for this language. To fill this gap, we introduce the first speech corpus and pronunciation lexicon for the Kurdish language. Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language. The designed sentences are uttered by 576 speakers in a controlled environment with noise-free microphones (called AsoSoft Speech-Office) and in Telegram social network environment using mobile phones (denoted as AsoSoft Speech-Crowdsourcing), resulted in 43.68 hours of speech. Besides, a test set including 11 different document topics is designed and recorded in two corresponding speech conditions (i.e., Office and Crowdsourcing). Furthermore, a 60K pronunciation lexicon is prepared in this research in which we faced several challenges and proposed solutions for them. The Kurdish language has several dialects and sub-dialects that results in many lexical variations. Our methods for script standardization of lexical variations and automatic pronunciation of the lexicon tokens are presented in detail. To setup the recognition engine, we used the Kaldi toolkit. A statistical tri-gram language model that is extracted from the AsoSoft text corpus is used in the system. Several standard recipes including HMM-based models (i.e., mono, tri1, tr2, tri2, tri3), SGMM, and DNN methods are used to generate the acoustic model. These methods are trained with AsoSoft Speech-Office and AsoSoft Speech-Crowdsourcing and a combination of them. The best performance achieved by the SGMM acoustic model which results in 13.9% of the average word error rate (on different document topics) and 4.9% for the general topic.
Nearly Minimax Optimal Regret for Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation
Wu, Yue, Zhou, Dongruo, Gu, Quanquan
We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $\sqrt{D}$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.
Sparse Attention Guided Dynamic Value Estimation for Single-Task Multi-Scene Reinforcement Learning
Training deep reinforcement learning agents on environments with multiple levels / scenes from the same task, has become essential for many applications aiming to achieve generalization and domain transfer from simulation to the real world. While such a strategy is helpful with generalization, the use of multiple scenes significantly increases the variance of samples collected for policy gradient computations. Current methods, effectively continue to view this collection of scenes as a single Markov decision process (MDP), and thus learn a scene-generic value function V(s). However, we argue that the sample variance for a multi-scene environment is best minimized by treating each scene as a distinct MDP, and then learning a joint value function V(s,M) dependent on both state s and MDP M. We further demonstrate that the true joint value function for a multi-scene environment, follows a multi-modal distribution which is not captured by traditional CNN / LSTM based critic networks. To this end, we propose a dynamic value estimation (DVE) technique, which approximates the true joint value function through a sparse attention mechanism over multiple value function hypothesis / modes. The resulting agent not only shows significant improvements in the final reward score across a range of OpenAI ProcGen environments, but also exhibits enhanced navigation efficiency and provides an implicit mechanism for unsupervised state-space skill decomposition.
Reinforcement Learning for IoT Security: A Comprehensive Survey
Uprety, Aashma, Rawat, Danda B.
The number of connected smart devices has been increasing exponentially for different Internet-of-Things (IoT) applications. Security has been a long run challenge in the IoT systems which has many attack vectors, security flaws and vulnerabilities. Securing billions of B connected devices in IoT is a must task to realize the full potential of IoT applications. Recently, researchers have proposed many security solutions for IoT. Machine learning has been proposed as one of the emerging solutions for IoT security and Reinforcement learning is gaining more popularity for securing IoT systems. Reinforcement learning, unlike other machine learning techniques, can learn the environment by having minimum information about the parameters to be learned. It solves the optimization problem by interacting with the environment adapting the parameters on the fly. In this paper, we present an comprehensive survey of different types of cyber-attacks against different IoT systems and then we present reinforcement learning and deep reinforcement learning based security solutions to combat those different types of attacks in different IoT systems. Furthermore, we present the Reinforcement learning for securing CPS systems (i.e., IoT with feedback and control) such as smart grid and smart transportation system. The recent important attacks and countermeasures using reinforcement learning B in IoT are also summarized in the form of tables. With this paper, readers can have a more thorough understanding of IoT security attacks and countermeasures using Reinforcement Learning, as well as research trends in this area.
State-Visitation Fairness in Average-Reward MDPs
Ghalme, Ganesh, Nair, Vineet, Patil, Vishakha, Zhou, Yilun
Fairness has emerged as an important concern in automated decision-making in recent years, especially when these decisions affect human welfare. In this work, we study fairness in temporally extended decision-making settings, specifically those formulated as Markov Decision Processes (MDPs). Our proposed notion of fairness ensures that each state's long-term visitation frequency is more than a specified fraction. In an average-reward MDP (AMDP) setting, we formulate the problem as a bilinear saddle point program and, for a generative model, solve it using a Stochastic Mirror Descent (SMD) based algorithm. The proposed solution guarantees a simultaneous approximation on the expected average-reward and the long-term state-visitation frequency. We validate our theoretical results with experiments on synthetic data.
Online Apprenticeship Learning
Shani, Lior, Zahavy, Tom, Mannor, Shie
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function. Instead, we observe trajectories sampled by an expert that acts according to some policy. The goal is to find a policy that matches the expert's performance on some predefined set of cost functions. We introduce an online variant of AL (Online Apprenticeship Learning; OAL), where the agent is expected to perform comparably to the expert while interacting with the environment. We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms: one for policy optimization and another for learning the worst case cost. To this end, we derive a convergent algorithm with $O(\sqrt{K})$ regret, where $K$ is the number of interactions with the MDP, and an additional linear error term that depends on the amount of expert trajectories available. Importantly, our algorithm avoids the need to solve an MDP at each iteration, making it more practical compared to prior AL methods. Finally, we implement a deep variant of our algorithm which shares some similarities to GAIL \cite{ho2016generative}, but where the discriminator is replaced with the costs learned by the OAL problem. Our simulations demonstrate our theoretically grounded approach outperforms the baselines.
Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning
Li, Gen, Cai, Changxiao, Chen, Yuxin, Gu, Yuantao, Wei, Yuting, Chi, Yuejie
Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made recently towards understanding the sample efficiency of Q-learning. To yield an entrywise $\varepsilon$-accurate estimate of the optimal Q-function, state-of-the-art theory requires at least an order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{2}}$ samples for a $\gamma$-discounted infinite-horizon MDP with state space $\mathcal{S}$ and action space $\mathcal{A}$. In this work, we sharpen the sample complexity of synchronous Q-learning to an order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^4\varepsilon^2}$ (up to some logarithmic factor) for any $0<\varepsilon <1$, leading to an order-wise improvement in terms of the effective horizon $\frac{1}{1-\gamma}$. Analogous results are derived for finite-horizon MDPs as well. Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage. A key ingredient of our analysis lies in the establishment of novel error decompositions and recursions, which might shed light on how to analyze finite-sample performance of other Q-learning variants.
Markov model with machine learning integration for fraud detection in health insurance
Gupta, Rohan Yashraj, Mudigonda, Satya Sai, Baruah, Pallav Kumar, Kandala, Phani Krishna
Fraud has led to a huge addition of expenses in health insurance sector in India. The work is aimed to provide methods applied to health insurance fraud detection. The work presents two approaches - a markov model and an improved markov model using gradient boosting method in health insurance claims. The dataset 382,587 claims of which 38,082 claims are fraudulent. The markov based model gave the accuracy of 94.07% with F1-score at 0.6683. However, the improved markov model performed much better in comparison with the accuracy of 97.10% and F1-score of 0.8546. It was observed that the improved markov model gave much lower false positives compared to markov model.
Causal Inference for Time series Analysis: Problems, Methods and Evaluation
Moraffah, Raha, Sheth, Paras, Karami, Mansooreh, Bhattacharya, Anchit, Wang, Qianru, Tahir, Anique, Raglin, Adrienne, Liu, Huan
Time series data is a collection of chronological observations which is generated by several domains such as medical and financial fields. Over the years, different tasks such as classification, forecasting, and clustering have been proposed to analyze this type of data. Time series data has been also used to study the effect of interventions over time. Moreover, in many fields of science, learning the causal structure of dynamic systems and time series data is considered an interesting task which plays an important role in scientific discoveries. Estimating the effect of an intervention and identifying the causal relations from the data can be performed via causal inference. Existing surveys on time series discuss traditional tasks such as classification and forecasting or explain the details of the approaches proposed to solve a specific task. In this paper, we focus on two causal inference tasks, i.e., treatment effect estimation and causal discovery for time series data, and provide a comprehensive review of the approaches in each task. Furthermore, we curate a list of commonly used evaluation metrics and datasets for each task and provide in-depth insight. These metrics and datasets can serve as benchmarks for research in the field.