Undirected Networks
A Model for Auto-Programming for General Purposes
The Universal Turing Machine (TM) is a model for VonNeumann computers --- general-purpose computers. A human brain can inside-skull-automatically learn a universal TM so that he acts as a general-purpose computer and writes a computer program for any practical purposes. It is unknown whether a machine can accomplish the same. This theoretical work shows how the Developmental Network (DN) can accomplish this. Unlike a traditional TM, the TM learned by DN is a super TM --- Grounded, Emergent, Natural, Incremental, Skulled, Attentive, Motivated, and Abstractive (GENISAMA). A DN is free of any central controller (e.g., Master Map, convolution, or error back-propagation). Its learning from a teacher TM is one transition observation at a time, immediate, and error-free until all its neurons have been initialized by early observed teacher transitions. From that point on, the DN is no longer error-free but is always optimal at every time instance in the sense of maximal likelihood, conditioned on its limited computational resources and the learning experience. This letter also extends the Church-Turing thesis to automatic programming for general purposes and sketchily proved it.
Policy Transfer with Strategy Optimization
Yu, Wenhao, Liu, C. Karen, Turk, Greg
Computer simulation provides an automatic and safe way for training robotic control policies to achieve complex tasks such as locomotion. However, a policy trained in simulation usually does not transfer directly to the real hardware due to the differences between the two environments. Transfer learning using domain randomization is a promising approach, but it usually assumes that the target environment is close to the distribution of the training environments, thus relying heavily on accurate system identification. In this paper, we present a different approach that leverages domain randomization for transferring control policies to unknown environments. The key idea that, instead of learning a single policy in the simulation, we simultaneously learn a family of policies that exhibit different behaviors. When tested in the target environment, we directly search for the best policy in the family based on the task performance, without the need to identify the dynamic parameters. We evaluate our method on five simulated robotic control problems with different discrepancies in the training and testing environment and demonstrate that our method can overcome larger modeling errors compared to training a robust policy or an adaptive policy. Recent developments in Deep Reinforcement Learning (DRL) have shown the potential to learn complex robotic controllers in an automatic way with minimal human intervention. However, due to the high sample complexity of DRL algorithms, directly training control policies on the hardware still remains largely impractical for agile tasks such as locomotion. A promising direction to address this issue is to use the idea of transfer learning which learns a model in a source environment and transfers it to a target environment of interest. In the context of learning robotic control policies, we can consider the real world the target environment and the computer simulation the source environment.
Optimal Hierarchical Learning Path Design with Reinforcement Learning
Li, Xiao, Xu, Hanchen, Zhang, Jinming, Chang, Hua-hua
E-learning systems are capable of providing more adaptive and efficient learning experiences for students than the traditional classroom setting. A key component of such systems is the learning strategy, the algorithm that designs the learning paths for students based on information such as the students' current progresses, their skills, learning materials, and etc. In this paper, we address the problem of finding the optimal learning strategy for an E-learning system. To this end, we first develop a model for students' hierarchical skills in the E-learning system. Based on the hierarchical skill model and the classical cognitive diagnosis model, we further develop a framework to model various proficiency levels of hierarchical skills. The optimal learning strategy on top of the hierarchical structure is found by applying a model-free reinforcement learning method, which does not require information on students' learning transition process. The effectiveness of the proposed framework is demonstrated via numerical experiments.
Is multiagent deep reinforcement learning the answer or the question? A brief survey
Hernandez-Leal, Pablo, Kartal, Bilal, Taylor, Matthew E.
Deep reinforcement learning (DRL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. In this context, first, this article provides a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Second, it provides guidelines to complement this emerging area by (i) showcasing examples on how methods and algorithms from DRL and multiagent learning (MAL) have helped solve problems in MDRL and (ii) providing general lessons learned from these works. We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists in both areas (DRL and MAL) in a joint effort to promote fruitful research in the multiagent community.
The Viterbi process, decay-convexity and parallelized maximum a-posteriori estimation
The Viterbi process is the limiting maximum a-posteriori estimate of the unobserved path in a hidden Markov model as the length of the time horizon grows. The existence of such a process suggests that approximate estimation using optimization algorithms which process data segments in parallel may be accurate. For models on state-space $\mathbb{R}^{d}$ satisfying a new "decay-convexity" condition, we approach the existence of the Viterbi process through fixed points of ordinary differential equations in a certain infinite dimensional Hilbert space. Quantitative bounds on the distance to the Viterbi process show that approximate estimation via parallelization can indeed be accurate and scaleable to high-dimensional problems because the rate of convergence to the Viterbi process does not necessarily depend on $d$.
Policy Design for Active Sequential Hypothesis Testing using Deep Learning
Kartik, Dhruva, Sabir, Ekraam, Mitra, Urbashi, Natarajan, Prem
Information theory has been very successful in obtaining performance limits for various problems such as communication, compression and hypothesis testing. Likewise, stochastic control theory provides a characterization of optimal policies for Partially Observable Markov Decision Processes (POMDPs) using dynamic programming. However, finding optimal policies for these problems is computationally hard in general and thus, heuristic solutions are employed in practice. Deep learning can be used as a tool for designing better heuristics in such problems. In this paper, the problem of active sequential hypothesis testing is considered. The goal is to design a policy that can reliably infer the true hypothesis using as few samples as possible by adaptively selecting appropriate queries. This problem can be modeled as a POMDP and bounds on its value function exist in literature. However, optimal policies have not been identified and various heuristics are used. In this paper, two new heuristics are proposed: one based on deep reinforcement learning and another based on a KL-divergence zero-sum game. These heuristics are compared with state-of-the-art solutions and it is demonstrated using numerical experiments that the proposed heuristics can achieve significantly better performance than existing methods in some scenarios.
S-RL Toolbox: Environments, Datasets and Evaluation Metrics for State Representation Learning
Raffin, Antonin, Hill, Ashley, Traorรฉ, Renรฉ, Lesort, Timothรฉe, Dรญaz-Rodrรญguez, Natalia, Filliat, David
State representation learning aims at learning compact representations from raw observations in robotics and control applications. Approaches used for this objective are auto-encoders, learning forward models, inverse dynamics or learning using generic priors on the state characteristics. However, the diversity in applications and methods makes the field lack standard evaluation datasets, metrics and tasks. This paper provides a set of environments, data generators, robotic control tasks, metrics and tools to facilitate iterative state representation learning and evaluation in reinforcement learning settings.
Applications of PageRank to Function Comparison and Malware Classification
Slawinski, Michael A., Wortman, Andy
We classify .NET files as either benign or malicious by examining certain directed graphs extracted from the files via decompilation. Each graph is viewed probabilistically as a Markov chain where each node heuristically represents the possible state of the running file, and by computing the PageRank vector (Perron vector with transport) we can assign a probability measure over the nodes of the given graph. We train a random forest with features derived from computing Lebesgue antiderivatives of functions defined over the vertex sets of the graphs listed above against the PageRank measure. The model was trained on 2.5 million samples of .NET and has an accuracy of 98.3\% on test data. The median time needed for decompilation and scoring was 24ms.
Deep Neural Network Compression for Aircraft Collision Avoidance Systems
Julian, Kyle D., Kochenderfer, Mykel J., Owen, Michael P.
The resulting collision avoidance strategy can be represented as a numeric table. This methodology has been used in the development of the Airborne Collision Avoidance System X (ACAS X) family of collision avoidance systems for manned and unmanned aircraft, but the high dimensionality of the state space leads to very large tables. To improve storage efficiency, a deep neural network is used to approximate the table. With the use of an asymmetric loss function and a gradient descent algorithm, the parameters for this network can be trained to provide accurate estimates of table values while preserving the relative preferences of the possible advisories for each state. By training multiple networks to represent subtables, the network also decreases the required runtime for computing the collision avoidance advisory. Simulation studies show that the network improves the safety and efficiency of the collision avoidance system. Because only the network parameters need to be stored, the required storage space is reduced by a factor of 1000, enabling the collision avoidance system to operate using current avionics systems.
Discovering General-Purpose Active Learning Strategies
Konyushkova, Ksenia, Sznitman, Raphael, Fua, Pascal
We propose a general-purpose approach to discovering active learning (AL) strategies from data. These strategies are transferable from one domain to another and can be used in conjunction with many machine learning models. To this end, we formalize the annotation process as a Markov decision process, design universal state and action spaces and introduce a new reward function that precisely model the AL objective of minimizing the annotation cost We seek to find an optimal (non-myopic) AL strategy using reinforcement learning. We evaluate the learned strategies on multiple unrelated domains and show that they consistently outperform state-of-the-art baselines. Modern supervised machine learning (ML) methods require large annotated datasets for training purposes and the cost of producing them can easily become prohibitive. Active learning (AL) mitigates the problem by selecting intelligently and adaptively a subset of the data to be annotated. To do so, AL typically relies on informativeness measures that identify unlabelled datapoints whose labels are most likely to help to improve the performance of the trained model. As a result, good performance is achieved using far fewer annotations than by randomly labelling data. Most AL selection strategies are hand-designed either on the basis of researcher's expertise and intuition or by approximating theoretical criteria (Settles, 2012).