Markov Models
HyperSPNs: Compact and Expressive Probabilistic Circuits
Shih, Andy, Sadigh, Dorsa, Ermon, Stefano
Probabilistic circuits (PCs) are a family of generative models which allows for the computation of exact likelihoods and marginals of its probability distributions. PCs are both expressive and tractable, and serve as popular choices for discrete density estimation tasks. However, large PCs are susceptible to overfitting, and only a few regularization strategies (e.g., dropout, weight-decay) have been explored. We propose HyperSPNs: a new paradigm of generating the mixture weights of large PCs using a small-scale neural network. Our framework can be viewed as a soft weight-sharing strategy, which combines the greater expressiveness of large models with the better generalization and memory-footprint properties of small models. We show the merits of our regularization strategy on two state-of-the-art PC families introduced in recent literature -- RAT-SPNs and EiNETs -- and demonstrate generalization improvements in both models on a suite of density estimation benchmarks in both discrete and continuous domains.
Safe Exploration for Constrained Reinforcement Learning with Provable Guarantees
Bura, Archana, HasanzadeZonuzy, Aria, Kalathil, Dileep, Shakkottai, Srinivas, Chamberland, Jean-Francois
We consider the problem of learning an episodic safe control policy that minimizes an objective function, while satisfying necessary safety constraints -- both during learning and deployment. We formulate this safety constrained reinforcement learning (RL) problem using the framework of a finite-horizon Constrained Markov Decision Process (CMDP) with an unknown transition probability function. Here, we model the safety requirements as constraints on the expected cumulative costs that must be satisfied during all episodes of learning. We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm, and show that it achieves an $\tilde{\mathcal{O}}(S^{2}\sqrt{A H^{7}K}/ (\bar{C} - \bar{C}_{b}))$ cumulative regret without violating the safety constraints during learning, where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon length, $K$ is the number of learning episodes, and $(\bar{C} - \bar{C}_{b})$ is the safety gap, i.e., the difference between the constraint value and the cost of a known safe baseline policy. The scaling as $\tilde{\mathcal{O}}(\sqrt{K})$ is the same as the traditional approach where constraints may be violated during learning, which means that our algorithm suffers no additional regret in spite of providing a safety guarantee. Our key idea is to use an optimistic exploration approach with pessimistic constraint enforcement for learning the policy. This approach simultaneously incentivizes the exploration of unknown states while imposing a penalty for visiting states that are likely to cause violation of safety constraints. We validate our algorithm by evaluating its performance on benchmark problems against conventional approaches.
Double Fuzzy Probabilistic Interval Linguistic Term Set and a Dynamic Fuzzy Decision Making Model based on Markov Process with tts Application in Multiple Criteria Group Decision Making
The probabilistic linguistic term has been proposed to deal with probability distributions in provided linguistic evaluations. However, because it has some fundamental defects, it is often difficult for decision-makers to get reasonable information of linguistic evaluations for group decision making. In addition, weight information plays a significant role in dynamic information fusion and decision making process. However, there are few research methods to determine the dynamic attribute weight with time. In this paper, I propose the concept of double fuzzy probability interval linguistic term set (DFPILTS). Firstly, fuzzy semantic integration, DFPILTS definition, its preference relationship, some basic algorithms and aggregation operators are defined. Then, a fuzzy linguistic Markov matrix with its network is developed. Then, a weight determination method based on distance measure and information entropy to reducing the inconsistency of DFPILPR and obtain collective priority vector based on group consensus is developed. Finally, an aggregation-based approach is developed, and an optimal investment case from a financial risk is used to illustrate the application of DFPILTS and decision method in multi-criteria decision making.
VisualEnv: visual Gym environments with Blender
Scorsoglio, Andrea, Furfaro, Roberto
In this paper VisualEnv, a new tool for creating visual environment for reinforcement learning is introduced. It is the product of an integration of an open-source modelling and rendering software, Blender, and a python module used to generate environment model for simulation, OpenAI Gym. VisualEnv allows the user to create custom environments with photorealistic rendering capabilities and full integration with python. The framework is described and tested on a series of example problems that showcase its features for training reinforcement learning agents.
Steady-State Planning in Expected Reward Multichain MDPs
Atia, George K. | Beckus, Andre (Air Force Research Laboratory) | Alkhouri, Ismail | Velasquez, Alvaro
The planning domain has experienced increased interest in the formal synthesis of decision-making policies. This formal synthesis typically entails finding a policy which satisfies formal specifications in the form of some well-defined logic. While many such logics have been proposed with varying degrees of expressiveness and complexity in their capacity to capture desirable agent behavior, their value is limited when deriving decision-making policies which satisfy certain types of asymptotic behavior in general system models. In particular, we are interested in specifying constraints on the steady-state behavior of an agent, which captures the proportion of time an agent spends in each state as it interacts for an indefinite period of time with its environment. This is sometimes called the average or expected behavior of the agent and the associated planning problem is faced with significant challenges unless strong restrictions are imposed on the underlying model in terms of the connectivity of its graph structure. In this paper, we explore this steady-state planning problem that consists of deriving a decision-making policy for an agent such that constraints on its steady-state behavior are satisfied. A linear programming solution for the general case of multichain Markov Decision Processes (MDPs) is proposed and we prove that optimal solutions to the proposed programs yield stationary policies with rigorous guarantees of behavior.
DeepCQ+: Robust and Scalable Routing with Multi-Agent Deep Reinforcement Learning for Highly Dynamic Networks
Kaviani, Saeed, Ryu, Bo, Ahmed, Ejaz, Larson, Kevin, Le, Anh, Yahja, Alex, Kim, Jae H.
Highly dynamic mobile ad-hoc networks (MANETs) remain as one of the most challenging environments to develop and deploy robust, efficient, and scalable routing protocols. In this paper, we present DeepCQ+ routing protocol which, in a novel manner integrates emerging multi-agent deep reinforcement learning (MADRL) techniques into existing Q-learning-based routing protocols and their variants and achieves persistently higher performance across a wide range of topology and mobility configurations. While keeping the overall protocol structure of the Q-learning-based routing protocols, DeepCQ+ replaces statically configured parameterized thresholds and hand-written rules with carefully designed MADRL agents such that no configuration of such parameters is required a priori. Extensive simulation shows that DeepCQ+ yields significantly increased end-to-end throughput with lower overhead and no apparent degradation of end-to-end delays (hop counts) compared to its Q-learning based counterparts. Qualitatively, and perhaps more significantly, DeepCQ+ maintains remarkably similar performance gains under many scenarios that it was not trained for in terms of network sizes, mobility conditions, and traffic dynamics. To the best of our knowledge, this is the first successful application of the MADRL framework for the MANET routing problem that demonstrates a high degree of scalability and robustness even under environments that are outside the trained range of scenarios. This implies that our MARL-based DeepCQ+ design solution significantly improves the performance of Q-learning based CQ+ baseline approach for comparison and increases its practicality and explainability because the real-world MANET environment will likely vary outside the trained range of MANET scenarios. Additional techniques to further increase the gains in performance and scalability are discussed.
Improving Zero-shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions
Mazoure, Bogdan, Kostrikov, Ilya, Nachum, Ofir, Tompson, Jonathan
Reinforcement learning (RL) agents are widely used for solving complex sequential decision making tasks, but still exhibit difficulty in generalizing to scenarios not seen during training. While prior online approaches demonstrated that using additional signals beyond the reward function can lead to better generalization capabilities in RL agents, i.e. using self-supervised learning (SSL), they struggle in the offline RL setting, i.e. learning from a static dataset. We show that performance of online algorithms for generalization in RL can be hindered in the offline setting due to poor estimation of similarity between observations. We propose a new theoretically-motivated framework called Generalized Similarity Functions (GSF), which uses contrastive learning to train an offline RL agent to aggregate observations based on the similarity of their expected future behavior, where we quantify this similarity using \emph{generalized value functions}. We show that GSF is general enough to recover existing SSL objectives while also improving zero-shot generalization performance on a complex offline RL benchmark, offline Procgen.
Discrete Markov chains
The discrete case, generally known as a Markov chain, is discussed on this page. The Markov approach can be applied to the random behaviour of systems that vary discretely or continuously with respect to time and space. This discrete or continuous random variable is known as a stochastic process. Not all stochastic processes can be modelled using the basic Markov approach although there are techniques available for modelling some additional stochastic processes using extensions of this basic method. In order for the basic Markov approach to be applicable, the behaviour of the system must be characterized by a lack of memory, that is, the future states of a system are independent of all past states' except the immediately preceding one.
Deep Q-Learning based Reinforcement Learning Approach for Network Intrusion Detection
Alavizadeh, Hooman, Jang-Jaccard, Julian, Alavizadeh, Hootan
The rise of the new generation of cyber threats demands more sophisticated and intelligent cyber defense solutions equipped with autonomous agents capable of learning to make decisions without the knowledge of human experts. Several reinforcement learning methods (e.g., Markov) for automated network intrusion tasks have been proposed in recent years. In this paper, we introduce a new generation of network intrusion detection methods that combines a Q-learning-based reinforcement learning with a deep-feed forward neural network method for network intrusion detection. Our proposed Deep Q-Learning (DQL) model provides an ongoing auto-learning capability for a network environment that can detect different types of network intrusions using an automated trial-error approach and continuously enhance its detection capabilities. We provide the details of fine-tuning different hyperparameters involved in the DQL model for more effective self-learning. According to our extensive experimental results based on the NSL-KDD dataset, we confirm that the lower discount factor which is set as 0.001 under 250 episodes of training yields the best performance results. Our experimental results also show that our proposed DQL is highly effective in detecting different intrusion classes and outperforms other similar machine learning approaches.
Learning Long-Term Reward Redistribution via Randomized Return Decomposition
Ren, Zhizhou, Guo, Ruihan, Zhou, Yuan, Peng, Jian
Many practical applications of reinforcement learning require agents to learn from sparse and delayed rewards. It challenges the ability of agents to attribute their actions to future outcomes. In this paper, we consider the problem formulation of episodic reinforcement learning with trajectory feedback. It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory. A popular paradigm for this problem setting is learning with a designed auxiliary dense reward function, namely proxy reward, instead of sparse environmental signals. Based on this framework, this paper proposes a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning. We establish a surrogate problem by Monte-Carlo sampling that scales up least-squares-based reward redistribution to long-horizon problems. We analyze our surrogate loss function by connection with existing methods in the literature, which illustrates the algorithmic properties of our approach. In experiments, we extensively evaluate our proposed method on a variety of benchmark tasks with episodic rewards and demonstrate substantial improvement over baseline algorithms. Scaling reinforcement learning (RL) algorithms to practical applications has become the focus of numerous recent studies, including resource management (Mao et al., 2016), industrial control (Hein et al., 2017), drug discovery (Popova et al., 2018), and recommendation systems (Chen et al., 2018). One of the challenges in these real-world problems is the sparse and delayed environmental rewards. For example, in the molecular structure design problem, the target molecule property can only be evaluated after completing the whole sequence of modification operations (Zhou et al., 2019b). The sparsity of environmental feedback would complicate the attribution of rewards on agent actions and therefore can hinder the efficiency of learning (Rahmandad et al., 2009).