Markov Models
A Learning-Based Two-Stage Spectrum Sharing Strategy with Multiple Primary Transmit Power Levels
Zhang, Rui, Cheng, Peng, Chen, Zhuo, Li, Yonghui, Vucetic, Branka
Multi-parameter cognition in a cognitive radio network (CRN) provides a more thorough understanding of the radio environments, and could potentially lead to far more intelligent and efficient spectrum usage for a secondary user. In this paper, we investigate the multi-parameter cognition problem for a CRN where the primary transmitter (PT) radiates multiple transmit power levels, and propose a learning-based two-stage spectrum sharing strategy. We first propose a data-driven/machine learning based multi-level spectrum sensing scheme, including the spectrum learning (Stage I) and prediction (the first part in Stage II). This fully blind sensing scheme does not require any prior knowledge of the PT power characteristics. Then, based on a novel normalized power level alignment metric, we propose two prediction-transmission structures, namely periodic and non-periodic, for spectrum access (the second part in Stage II), which enable the secondary transmitter (ST) to closely follow the PT power level variation. The periodic structure features a fixed prediction interval, while the non-periodic one dynamically determines the interval with a proposed reinforcement learning algorithm to further improve the alignment metric. Finally, we extend the prediction-transmission structure to an online scenario, where the number of PT power levels might change as a consequence of PT adapting to the environment fluctuation or quality of service variation. The simulation results demonstrate the effectiveness of the proposed strategy in various scenarios.
Arena: a toolkit for Multi-Agent Reinforcement Learning
Wang, Qing, Xiong, Jiechao, Han, Lei, Fang, Meng, Sun, Xinghai, Zheng, Zhuobin, Sun, Peng, Zhang, Zhengyou
We introduce Arena, a toolkit for multi-agent reinforcement learning (MARL) research. In MARL, it usually requires customizing observations, rewards and actions for each agent, changing cooperative-competitive agent-interaction, and playing with/against a third-party agent, etc. We provide a novel modular design, called Interface, for manipulating such routines in essentially two ways: 1) Different interfaces can be concatenated and combined, which extends the OpenAI Gym Wrappers concept to MARL scenarios. 2) During MARL training or testing, interfaces can be embedded in either wrapped OpenAI Gym compatible Environments or raw environment compatible Agents. We offer off-the-shelf interfaces for several popular MARL platforms, including StarCraft II, Pommerman, ViZDoom, Soccer, etc. The interfaces effectively support self-play RL and cooperative-competitive hybrid MARL. Also, Arena can be conveniently extended to your own favorite MARL platform.
Dual Proxy Gaussian Process Stack: Integrating Benthic ${\delta}^{18}{\rm{O}}$ and Radiocarbon Proxies for Inferring Ages on Ocean Sediment Cores
Lee, Taehee, Lisiecki, Lorraine E., Rand, Devin, Gebbie, Geoffrey, Lawrence, Charles E.
Ages in ocean sediment cores are often inferred using either benthic ${\delta}^{18}{\rm{O}}$ or planktonic ${}^{14}{\rm{C}}$ of foraminiferal calcite. Existing probabilistic dating methods infer ages in two distinct approaches: ages are either inferred directly using radionuclides, e.g. Bacon [Blaauw and Christen (2011)]; or indirectly based on the alignment of records, e.g. HMM-Match [Lin et al. (2014)]. In this paper, we introduce a novel algorithm for integrating these two approaches by constructing Dual Proxy Gaussian Process (DPGP) stacks, which represent a probabilistic model of benthic ${\delta}^{18}{\rm{O}}$ change (and its timing) based on a set of cores. While a previous stack construction algorithm, HMM-Match, uses a discrete age inference model based on Hidden Markov models (HMMs) [Durbin et al. (1998)] and requires a number of records enough to sufficiently cover all its ages, DPGP stacks with time-varying variances are constructed with continuous ages obtained by particle smoothing [Doucet et al. (2001); Klaas et al. (2006)] and Markov-chain Monte Carlo (MCMC) [Peters (2008)] algorithms, and can be derived from a small number of records by applying the Gaussian process regression [Rasmussen and Williams (2005)]. As an example of the stacking method, we construct a local stack from 6 cores in the deep northeastern Atlantic Ocean and compare it to a deterministically constructed ${\delta}^{18}{\rm{O}}$ stack of 58 cores from the deep North Atlantic [Lisiecki and Stern (2016)]. We also provide two examples of how dual proxy alignment ages can be inferred by aligning additional cores to the stack.
Learning Multimorbidity Patterns from Electronic Health Records Using Non-negative Matrix Factorisation
Hassaine, Abdelaali, Canoy, Dexter, Solares, Jose Roberto Ayala, Zhu, Yajie, Rao, Shishir, Li, Yikuan, Rahimi, Kazem, Salimi-Khorshidi, Gholamreza
Multimorbidity, or the presence of several medical conditions in the same individual, have been increasing in the population both in absolute and relative terms. However, multimorbidity remains poorly understood, and the evidence from existing research to describe its burden, determinants and consequences have been limited. Many of these studies are often cross-sectional and do not explicitly account for multimorbidity patterns' evolution over time. Some studies were based on small datasets, used arbitrary or narrow age range, or lacked appropriate clinical validations. In this study, we applied Non-negative Matrix Factorisation (NMF) in a novel way to one of the largest electronic health records (EHR) databases in the world (with 4 million patients), for simultaneously modelling disease clusters and their role in one's multimorbidity over time. Furthermore, we demonstrated how the temporal characteristics that our model associates with each disease cluster can help mine disease trajectories/networks and generate new hypotheses for the formation of multimorbidity clusters as a function of time/ageing. Our results suggest that our method's ability to learn the underlying dynamics of diseases can provide the field with a novel data-driven / exploratory way of learning the patterns of multimorbidity and their interactions over time.
Delegative Reinforcement Learning: learning to avoid traps with a little help
Most known regret bounds for reinforcement learning are either episodic or assume an environment without traps. We derive a regret bound without making either assumption, by allowing the algorithm to occasionally delegate an action to an external advisor. We thus arrive at a setting of active one-shot model-based reinforcement learning that we call DRL (delegative reinforcement learning.) The algorithm we construct in order to demonstrate the regret bound is a variant of Posterior Sampling Reinforcement Learning supplemented by a subroutine that decides which actions should be delegated. The algorithm is not anytime, since the parameters must be adjusted according to the target time discount. Currently, our analysis is limited to Markov decision processes with finite numbers of hypotheses, states and actions.
Conditional Markov Chain Search for the Generalised Travelling Salesman Problem for Warehouse Order Picking
Nalivajevs, Olegs, Karapetyan, Daniel
The Generalised Travelling Salesman Problem (GTSP) is a well-known problem that, among other applications, arises in warehouse order picking, where each stock is distributed between several locations -- a typical approach in large modern warehouses. However, the instances commonly used in the literature have a completely different structure, and the methods are designed with those instances in mind. In this paper, we give a new pseudo-random instance generator that reflects the warehouse order picking and publish new benchmark testbeds. We also use the Conditional Markov Chain Search framework to automatically generate new GTSP metaheuristics trained specifically for warehouse order picking. Finally, we report the computational results of our metaheuristics to enable further competition between solvers.
Empowering A* Search Algorithms with Neural Networks for Personalized Route Recommendation
Wang, Jingyuan, Wu, Ning, Zhao, Wayne Xin, Peng, Fanzhang, Lin, Xin
Personalized Route Recommendation (PRR) aims to generate user-specific route suggestions in response to users' route queries. Early studies cast the PRR task as a pathfinding problem on graphs, and adopt adapted search algorithms by integrating heuristic strategies. Although these methods are effective to some extent, they require setting the cost functions with heuristics. In addition, it is difficult to utilize useful context information in the search procedure. To address these issues, we propose using neural networks to automatically learn the cost functions of a classic heuristic algorithm, namely A* algorithm, for the PRR task. Our model consists of two components. First, we employ attention-based Recurrent Neural Networks (RNN) to model the cost from the source to the candidate location by incorporating useful context information. Instead of learning a single cost value, the RNN component is able to learn a time-varying vectorized representation for the moving state of a user. Second, we propose to use a value network for estimating the cost from a candidate location to the destination. For capturing structural characteristics, the value network is built on top of improved graph attention networks by incorporating the moving state of a user and other context information. The two components are integrated in a principled way for deriving a more accurate cost of a candidate location. Extensive experiment results on three real-world datasets have shown the effectiveness and robustness of the proposed model.
Entropic Regularization of Markov Decision Processes
An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback-Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of $f$-divergences, and more concretely $\alpha$-divergences, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement is shown to correspond to the Pearson $\chi^2$-divergence penalty. Other actor-critic pairs arise for various choices of the penalty-generating function $f$. On a concrete instantiation of our framework with the $\alpha$-divergence, we carry out asymptotic analysis of the solutions for different values of $\alpha$ and demonstrate the effects of the divergence function choice on common standard reinforcement learning problems.
Learning End-to-End Goal-Oriented Dialog with Maximal User Task Success and Minimal Human Agent Use
Rajendran, Janarthanan, Ganhotra, Jatin, Polymenakos, Lazaros
Neural end-to-end goal-oriented dialog systems showed promise to reduce the workload of human agents for customer service, as well as reduce wait time for users. However, their inability to handle new user behavior at deployment has limited their usage in real world. In this work, we propose an end-to-end trainable method for neural goal-oriented dialog systems which handles new user behaviors at deployment by transferring the dialog to a human agent intelligently. The proposed method has three goals: 1) maximize user's task success by transferring to human agents, 2) minimize the load on the human agents by transferring to them only when it is essential and 3) learn online from the human agent's responses to reduce human agents load further. We evaluate our proposed method on a modified-bAbI dialog task that simulates the scenario of new user behaviors occurring at test time. Experimental results show that our proposed method is effective in achieving the desired goals.
Adversarial Security Attacks and Perturbations on Machine Learning and Deep Learning Methods
Cybersecurity also benefits from ML and DL methods for various types of applications. These methods however are susceptible to security attacks. The adversaries can exploit the training and testing data of the learning models or can explore the workings of those models for launching advanced future attacks. The topic of adversarial security attacks and perturbations within the ML and DL domains is a recent exploration and a great interest is expressed by the security researchers and practitioners. The literature covers different adversarial security attacks and perturbations on ML and DL methods and those have their own presentation styles and merits. A need to review and consolidate knowledge that is comprehending of this increasingly focused and growing topic of research; however, is the current demand of the research communities. In this review paper, we specifically aim to target new researchers in the cybersecurity domain who may seek to acquire some basic knowledge on the machine learning and deep learning models and algorithms, as well as some of the relevant adversarial security attacks and perturbations.