Undirected Networks
High efficiency rl agent
Liu, Jingbin, Gu, Xinyang, Zhang, Dexiang, Liu, Shuai
Now a day, model free algorithm achieve state of art performance on many RL problems, but the low efficiency of model free algorithm limited the usage. We combine model base RL, soft actor-critic framework, and curiosity. proposed an agent called RMC, giving a promise way to achieve good performance while maintain data efficiency. We suppress the performance of SAC and achieve state of the art performance, both on efficiency and stability. Meanwhile we can solving POMDP problem and achieve great generalization from MDP to POMDP.
MAT: Multi-Fingered Adaptive Tactile Grasping via Deep Reinforcement Learning
Wu, Bohan, Akinola, Iretiayo, Varley, Jacob, Allen, Peter
Vision-based grasping systems typically adopt an open-loop execution of a planned grasp. This policy can fail due to many reasons, including ubiquitous calibration error. Recovery from a failed grasp is further complicated by visual occlusion, as the hand is usually occluding the vision sensor as it attempts another open-loop regrasp. This work presents MAT, a tactile closed-loop method capable of realizing grasps provided by a coarse initial positioning of the hand above an object. Our algorithm is a deep reinforcement learning (RL) policy optimized through the clipped surrogate objective within a maximum entropy RL framework to balance exploitation and exploration. The method utilizes tactile and proprioceptive information to act through both fine finger motions and larger regrasp movements to execute stable grasps. A novel curriculum of action motion magnitude makes learning more tractable and helps turn common failure cases into successes. Careful selection of features that exhibit small sim-to-real gaps enables this tactile grasping policy, trained purely in simulation, to transfer well to real world environments without the need for additional learning. Experimentally, this methodology improves over a vision-only grasp success rate substantially on a multi-fingered robot hand. When this methodology is used to realize grasps from coarse initial positions provided by a vision-only planner, the system is made dramatically more robust to calibration errors in the camera-robot transform.
Neural Belief Reasoner
This paper proposes a new generative model called neural belief reasoner (NBR). It differs from previous models in that it specifies a belief function rather than a probability distribution. Its implementation consists of neural networks, fuzzy-set operations and belief-function operations, and query-answering, sample-generation and training algorithms are presented. This paper studies NBR in two tasks. The first is a synthetic unsupervised-learning task, which demonstrates NBR's ability to perform multi-hop reasoning, reasoning with uncertainty and reasoning about conflicting information. The second is supervised learning: a robust MNIST classifier. Without any adversarial training, this classifier exceeds the state of the art in adversarial robustness as measured by the L2 metric, and at the same time maintains 99% accuracy on natural images. A proof is presented that, as capacity increases, NBR classifiers can asymptotically approach the best possible robustness.
Deep Reinforcement Learning for Control of Probabilistic Boolean Networks
Papagiannis, Georgios, Moschoyiannis, Sotiris
Probabilistic Boolean Networks (PBNs) were introduced as a computational model for studying gene interactions in Gene Regulatory Networks (GRNs). Controllability of PBNs, and hence GRNs, is the process of making strategic interventions to a network in order to drive it from a particular state towards some other potentially more desirable state. This is of significant importance to systems biology as successful control could be used to obtain potential gene treatments by making therapeutic interventions. Recent advancements in Deep Reinforcement Learning have enabled systems to develop policies merely by interacting with the environment, without complete knowledge of the underlying Markov Decision Process (MDP). In this paper we have implemented a Deep Q Network with Double Q Learning, that directly interacts with the environment -that is, a Probabilistic Boolean Network. Our approach develops a control policy by sampling experiences obtained from the environment using Prioritized Experience Replay which successfully drives a PBN from any state towards the desired one. This novel approach sets the foundations for overcoming the inability to scale to larger PBNs and opens up the spectrum in which to consider control of GRNs without the need of a computational model, i.e. by direct interventions to the GRN.
Static force field representation of environments based on agents nonlinear motions
Campo, Damian, Betancourt, Alejandro, Marcenaro, Lucio, Regazzoni, Carlo
RESEARCH Static Force Field Representation of Environments Based on Agents' Nonlinear Motions Damian Campo 1*, Alejandro Betancourt 1,2, Lucio Marcenaro 1 and Carlo Regazzoni 1 Abstract This paper presents a methodology that aims at the incremental representation of areas inside environments in terms of attractive forces. It is proposed a parametric representation of velocity fields ruling the dynamics of moving agents. It is assumed that attractive spots in the environment are responsible for modifying the motion of agents. A switching model is used to describe near and far velocity fields, which in turn are used to learn attractive characteristics of environments. The effect of such areas is considered radial over all the scene. Based on the estimation of attractive areas, a map that describes their effects in terms of their localizations, ranges of action and intensities is derived in an online way . Information of static attractive areas is added dynamically into a set of filters that describes possible interactions between moving agents and an environment. The proposed approach is first evaluated on synthetic data, posteriorly, the method is applied on real trajectories coming from moving pedestrians in an indoor environment. Keywords: Kalman filtering; Interactive force models; T rajectory analysis; Representation of environments; Situation awareness1 Introduction Analysis of trajectories performed by moving entities in environments is an important topic for different fields such as video surveillance [1], crowd/vehicle analysis [2, 3] and in general for monitoring systems, on which the dynamics of agents can lead to a better understanding of patterns and situations of interest [4, 5]. Abnormality detection is one of the most explored applications that involves analysis of trajectories. In such approach, by characterizing agents' motions, it is possible to learn and identify normal/abnormal situations in a certain environment. In general, approaches for abnormality detection are based on a set of observations that define the regular behaviors in a scene. Afterwards, abnormalities are defined as behaviors that do not match with patterns previously learned as normal, i.e., behaviors that have not been observed before [6].
Combining Learned Representations for Combinatorial Optimization
Patel, Saavan, Salahuddin, Sayeef
We propose a new approach to combine Restricted Boltzmann Machines (RBMs) that can be used to solve combinatorial optimization problems. This allows synthesis of larger models from smaller RBMs that have been pretrained, thus effectively bypassing the problem of learning in large RBMs, and creating a system able to model a large, complex multi-modal space. We validate this approach by using learned representations to create "invertible boolean logic", where we can use Markov chain Monte Carlo (MCMC) approaches to find the solution to large scale boolean satisfiability problems and show viability towards other combinatorial optimization problems. Using this method, we are able to solve 64 bit addition based problems, as well as factorize 16 bit numbers. We find that these combined representations can provide a more accurate result for the same sample size as compared to a fully trained model. The Ising Problem has long been known to be in the class of NP-Hard problems, with no exact polynomial solution existing. Because of this, a large class of combinatorial optimization problems can be reformulated as Ising problems and solved by finding the ground state of that system (Barahona, 1982; Kirkpatrick et al., 1983; Lucas, 2014). The Boltzmann Machine (Ackley et al., 1987) was originally introduced as a constraint satisfaction network based on the Ising model problem, where the weights would encode some global constraints, and stochastic units were used to escape local minima. The original Boltzmann Machine found favor as a method to solve various combinatorial optimization problems (Korst & Aarts, 1989). However, learning was very slow with this model due to the difficulties with sampling and convergence, as well as the inability to exactly calculate the partition function.
An Efficient Algorithm for Multiple-Pursuer-Multiple-Evader Pursuit/Evasion Game
We present a method for pursuit/evasion that is highly efficient and and scales to large teams of aircraft. The underlying algorithm is an efficient algorithm for solving Markov Decision Processes (MDPs) that supports fully continuous state spaces. We demonstrate the algorithm in a team pursuit/evasion setting in a 3D environment using a pseudo-6DOF model and study performance by varying sizes of team members. We show that as the number of aircraft in the simulation grows, computational performance remains efficient and is suitable for real-time systems. We also define probability-to-win and survivability metrics that describe the teams' performance over multiple trials, and show that the algorithm performs consistently. We provide numerical results showing control inputs for a typical 1v1 encounter and provide videos for 1v1, 2v2, 3v3, 4v4, and 10v10 contests to demonstrate the ability of the algorithm to adapt seamlessly to complex environments.
Off-Policy Evaluation in Partially Observable Environments
Tennenholtz, Guy, Mannor, Shie, Shalit, Uri
This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP . We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to the procedure we provided for general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, importance sampling, and compare with our result on synthetic medical data.
Quantile Markov Decision Process
Li, Xiaocheng, Zhong, Huaiyang, Brandeau, Margaret L.
In this paper, we consider the problem of optimizing the quantiles of the cumulative rewards of Markov Decision Processes (MDP), to which we refers as Quantile Markov Decision Processes (QMDP). Traditionally, the goal of a Markov Decision Process (MDP) is to maximize expected cumulative reward over a defined horizon (possibly to be infinite). In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward instead of its expectation. Our framework of QMDP provides analytical results characterizing the optimal QMDP solution and presents the algorithm for solving the QMDP. We provide analytical results characterizing the optimal QMDP solution and present the algorithms for solving the QMDP. We illustrate the model with two experiments: a grid game and a HIV optimal treatment experiment.
Part-of-Speech Tagging
Rule-Based: A dictionary is constructed with possible tags for each word. Rules are either hand-crafted, learned or both. An example rule might say, "If an ambiguous/unknown word X is preceded by a determiner and followed by a noun, tag it as an adjective." Statistical: A text corpus is used to derive useful probabilities. Given a sequence of words, the most probable sequence of tags is selected.