Undirected Networks
Multi-Objective Controller Synthesis with Uncertain Human Preferences
Chen, Shenghui, Boggess, Kayla, Parker, David, Feng, Lu
Multi-objective controller synthesis concerns the problem of computing an optimal controller subject to multiple (possibly conflicting) objective properties. The relative importance of objectives is often specified by human decision-makers. However, there is inherent uncertainty in human preferences (e.g., due to different preference elicitation methods). In this paper, we formalize the notion of uncertain human preferences and present a novel approach that accounts for uncertain human preferences in the multi-objective controller synthesis for Markov decision processes (MDPs). Our approach is based on mixed-integer linear programming (MILP) and synthesizes a sound, optimally permissive multi-strategy with respect to a multi-objective property and an uncertain set of human preferences. Experimental results on a range of large case studies show that our MILP-based approach is feasible and scalable to synthesize sound, optimally permissive multi-strategies with varying MDP model sizes and uncertainty levels of human preferences. Evaluation via an online user study also demonstrates the quality and benefits of synthesized (multi-)strategies.
Fast constraint satisfaction problem and learning-based algorithm for solving Minesweeper
Sinha, Yash Pratyush, Malviya, Pranshu, Nayak, Rupaj Kumar
Minesweeper is a popular spatial-based decision-making game that works with incomplete information. As an exemplary NP-complete problem, it is a major area of research employing various artificial intelligence paradigms. The present work models this game as Constraint Satisfaction Problem (CSP) and Markov Decision Process (MDP). We propose a new method named as dependents from the independent set using deterministic solution search (DSScsp) for the faster enumeration of all solutions of a CSP based Minesweeper game and improve the results by introducing heuristics. Using MDP, we implement machine learning methods on these heuristics. We train the classification model on sparse data with results from CSP formulation. We also propose a new rewarding method for applying a modified deep Q-learning for better accuracy and versatile learning in the Minesweeper game. The overall results have been analyzed for different kinds of Minesweeper games and their accuracies have been recorded. Results from these experiments show that the proposed method of MDP based classification model and deep Q-learning overall is the best methods in terms of accuracy for games with given mine densities.
From Distributed Machine Learning to Federated Learning: A Survey
Liu, Ji, Huang, Jizhou, Zhou, Yang, Li, Xuhong, Ji, Shilei, Xiong, Haoyi, Dou, Dejing
In recent years, data and computing resources are typically distributed in the devices of end users, various regions or organizations. Because of laws or regulations, the distributed data and computing resources cannot be directly shared among different regions or organizations for machine learning tasks. Federated learning emerges as an efficient approach to exploit distributed data and computing resources, so as to collaboratively train machine learning models, while obeying the laws and regulations and ensuring data security and data privacy. In this paper, we provide a comprehensive survey of existing works for federated learning. We propose a functional architecture of federated learning systems and a taxonomy of related techniques. Furthermore, we present the distributed training, data communication, and security of FL systems. Finally, we analyze their limitations and propose future research directions.
A perspective on the history of Artificial Intelligence (AI)
Artificial Intelligence (AI) history consists of original work and research by not only mathematicians and computer scientists, but studies by psychologists, physicists, and economists have also been much used. The timeline consists of the pre-1950 era of statistical methods to present AlphaZero in 2017 and more. The most significant push in the development of technology was during the 2nd world war where both the allied forces and their enemies worked hard to develop technology which can help them get superiority over others. The timeline started in 1943, work by McCulloch and Pitts on Artificial Neuron gets the recognition of first work on AI. After work done by and McCulloch, Donald Hebb demonstrated rule for modifying connection strings between neurons -- this is called Hebbian learning.
Non-asymptotic Performances of Robust Markov Decision Processes
Markov Decision Processes (MDPs) play key mathematical models in Reinforcement Learning (RL). Despite its success in empirical performances [Haarnoja et al., 2018, Mnih et al., 2015, 2016, Silver et al., 2016], there are also many works providing insightful and solid theoretical understandings towards RL. The difficulty of solving an MDP mainly is due to the reward and transition probability, whose exact information is usually unknown to observers. To deal with the situations, one common approach resorts to offline methods, where the agent only has access to a given explorable dataset generated by given strategies. Many practical deep RL algorithms employ the offline method and achieve state-of-art success [Mnih et al., 2015, Lillicrap et al., 2015, Fujimoto et al., 2019]. In addition to empirical success, there are flourishing works on offline RL from a theoretical perspective. Some prior works [Chen and Jiang, 2019, Agarwal et al., 2020, Duan et al., 2021] have provided solid results on model-free offline methods, while some other works [Sidford et al., 2018, Xie et al., 2019, Yin and Wang, 2020, Yin et al., 2020] consider model-based approaches. However, Mannor et al. [2004] showed that model-based approaches can be quite sensitive to estimation errors by directly estimating the transition probability from an offline dataset.
Reinforcement Learning with Expert Trajectory For Quantitative Trading
Chen, Sihang, Luo, Weiqi, Yu, Chao
In recent years, quantitative investment methods combined with artificial intelligence have attracted more and more attention from investors and researchers. Existing related methods based on the supervised learning are not very suitable for learning problems with long-term goals and delayed rewards in real futures trading. In this paper, therefore, we model the price prediction problem as a Markov decision process (MDP), and optimize it by reinforcement learning with expert trajectory. In the proposed method, we employ more than 100 short-term alpha factors instead of price, volume and several technical factors in used existing methods to describe the states of MDP. Furthermore, unlike DQN (deep Q-learning) and BC (behavior cloning) in related methods, we introduce expert experience in training stage, and consider both the expert-environment interaction and the agent-environment interaction to design the temporal difference error so that the agents are more adaptable for inevitable noise in financial data. Experimental results evaluated on share price index futures in China, including IF (CSI 300) and IC (CSI 500), show that the advantages of the proposed method compared with three typical technical analysis and two deep leaning based methods.
Learning to Detect an Odd Restless Markov Arm with a Trembling Hand
Karthik, P. N., Sundaresan, Rajesh
This paper studies the problem of finding an anomalous arm in a multi-armed bandit when (a) each arm is a finite-state Markov process, and (b) the arms are restless. Here, anomaly means that the transition probability matrix (TPM) of one of the arms (the odd arm) is different from the common TPM of each of the non-odd arms. The TPMs are unknown to a decision entity that wishes to find the index of the odd arm as quickly as possible, subject to an upper bound on the error probability. We derive a problem instance-specific asymptotic lower bound on the expected time required to find the odd arm index, where the asymptotics is as the error probability vanishes. Further, we devise a policy based on the principle of certainty equivalence, and demonstrate that under a continuous selection assumption and a certain regularity assumption on the TPMs, the policy achieves the lower bound arbitrarily closely. Thus, while the lower bound is shown for all problem instances, the upper bound is shown only for those problem instances satisfying the continuous selection and the regularity assumptions. Our achievability analysis is based on resolving the identifiability problem in the context of a certain lifted countable-state controlled Markov process. Index Terms Odd arm identification, restless multi-armed bandits, controlled Markov process, certainty equivalence, identifiability, anomaly detection, anomaly, anomalous process. Consider a multi-armed bandit in which each arm is a time-homogeneous and ergodic discrete-time Markov process taking values in a common finite state space.
Reward prediction for representation learning and reward shaping
Hlynsson, Hlynur Davรญรฐ, Wiskott, Laurenz
One of the fundamental challenges in reinforcement learning (RL) is the one of data efficiency: modern algorithms require a very large number of training samples, especially compared to humans, for solving environments with high-dimensional observations. The severity of this problem is increased when the reward signal is sparse. In this work, we propose learning a state representation in a self-supervised manner for reward prediction. The reward predictor learns to estimate either a raw or a smoothed version of the true reward signal in environment with a single, terminating, goal state. We augment the training of out-of-the-box RL agents by shaping the reward using our reward predictor during policy learning. Using our representation for preprocessing high-dimensional observations, as well as using the predictor for reward shaping, is shown to significantly enhance Actor Critic using Kronecker-factored Trust Region and Proximal Policy Optimization in single-goal environments with visual inputs.
Decomposed Soft Actor-Critic Method for Cooperative Multi-Agent Reinforcement Learning
Pu, Yuan, Wang, Shaochen, Yang, Rui, Yao, Xin, Li, Bin
Deep reinforcement learning methods have shown great performance on many challenging cooperative multi-agent tasks. Two main promising research directions are multi-agent value function decomposition and multi-agent policy gradients. In this paper, we propose a new decomposed multi-agent soft actor-critic (mSAC) method, which effectively combines the advantages of the aforementioned two methods. The main modules include decomposed Q network architecture, discrete probabilistic policy and counterfactual advantage function (optinal). Theoretically, mSAC supports efficient off-policy learning and addresses credit assignment problem partially in both discrete and continuous action spaces. Tested on StarCraft II micromanagement cooperative multiagent benchmark, we empirically investigate the performance of mSAC against its variants and analyze the effects of the different components. Experimental results demonstrate that mSAC significantly outperforms policy-based approach COMA, and achieves competitive results with SOTA value-based approach Qmix on most tasks in terms of asymptotic perfomance metric. In addition, mSAC achieves pretty good results on large action space tasks, such as 2c vs 64zg and MMM2.