AITopics

1912.04009

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre:

Research Report (0.50)
Instructional Material (0.46)

Industry: Banking & Finance > Trading (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Brown, Daniel S., Niekum, Scott

Deep Bayesian Reward Learning from Preferences

arXiv.org Artificial IntelligenceDec-9-2019

Bayesian inverse reinforcement learning (IRL) methods are ideal for safe imitation learning, as they allow a learning agent to reason about reward uncertainty and the safety of a learned policy. However, Bayesian IRL is computationally intractable for high-dimensional problems because each sample from the posterior requires solving an entire Markov Decision Process (MDP). While there exist non-Bayesian deep IRL methods, these methods typically infer point estimates of reward functions, precluding rigorous safety and uncertainty analysis. We propose Bayesian Reward Extrapolation (B-REX), a highly efficient, preference-based Bayesian reward learning algorithm that scales to high-dimensional, visual control tasks. Our approach uses successor feature representations and preferences over demonstrations to efficiently generate samples from the posterior distribution over the demonstrator's reward function without requiring an MDP solver. Using samples from the posterior, we demonstrate how to calculate high-confidence bounds on policy performance in the imitation learning setting, in which the ground-truth reward function is unknown. We evaluate our proposed approach on the task of learning to play Atari games via imitation learning from pixel inputs, with no access to the game score. We demonstrate that B-REX learns imitation policies that are competitive with a state-of-the-art deep imitation learning method that only learns a point estimate of the reward function. Furthermore, we demonstrate that samples from the posterior generated via B-REX can be used to compute high-confidence performance bounds for a variety of evaluation policies. We show that high-confidence performance bounds are useful for accurately ranking different evaluation policies when the reward function is unknown. We also demonstrate that high-confidence performance bounds may be useful for detecting reward hacking.

demonstration, imitation, reward function, (13 more...)

1912.04472

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > Canada (0.04)

Genre: Research Report (0.82)

Industry:

Education (0.93)
Leisure & Entertainment > Games > Computer Games (0.55)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.88)

Maeda, Guilherme, Koc, Okan, Morimoto, Jun

Phase Portraits as Movement Primitives for Fast Humanoid Robot Control

arXiv.org Artificial IntelligenceDec-7-2019

Currently, usual approaches for fast robot control are largely reliant on solving online optimal control problems. Such methods are known to be computationally intensive and sensitive to model accuracy. On the other hand, animals plan complex motor actions not only fast but seemingly with little effort even on unseen tasks. This natural sense of time and coordination motivates us to approach robot control from a motor skill learning perspective to design fast and computationally light controllers that can be learned autonomously by the robot under mild modeling assumptions. This article introduces Phase Portrait Movement Primitives (PPMP), a primitive that predicts dynamics on a low dimensional phase space which in turn is used to govern the high dimensional kinematics of the task. The stark difference with other primitive formulations is a built-in mechanism for phase prediction in the form of coupled oscillators that replaces model-based state estimators such as Kalman filters. The policy is trained by optimizing the parameters of the oscillators whose output is connected to a kinematic distribution in the form of a phase portrait. The drastic reduction in dimensionality allows us to efficiently train and execute PPMPs on a real human-sized, dual-arm humanoid upper body on a task involving 20 degrees-of-freedom. We demonstrate PPMPs in interactions requiring fast reactions times while generating anticipative pose adaptation in both discrete and cyclic tasks.

oscillator, robot, trajectory, (15 more...)

1912.03535

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
North America > United States > New York (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)

Genre: Research Report (1.00)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Kamalzadeh, Hossein, Ahmadi, Abbas, Mansour, Saeed

Clustering Time-Series by a Novel Slope-Based Similarity Measure Considering Particle Swarm Optimization

arXiv.org Machine LearningDec-5-2019

Recently there has been an increase in the studies on time - series data mining specifically time - series clustering due to the vast existe nce of time - series in various domains. The large volume of data in the form of time - series make s it necessary to employ various techniques such as clustering to understand the data and to extract information and hidden patterns. In the field of clustering specifically, time - series clustering, the most important aspects are the similarity measure used and the algorithm employed to conduct the clustering. In this paper, a new similarity measure for time - series clustering is developed based on a combination of a simple representation of time - series, slope of each segment of time - series, Euclidean distance and the so - called dynamic time warping. It is proved in this paper that the proposed distance measure is metric and thus indexing can be applied. For the task of clustering, the Particle Swarm Optimization algorithm is employed. The proposed similarity measure is compared to three existing measures in terms of various criteria used for the evaluation of clustering algorithms. The results indicate that the propo sed similarity measure outperforms the rest in almost every dataset used in this paper.

algorithm, distance measure, sin 1, (14 more...)

1912.02405

Country:

North America > United States > Texas > Dallas County > Dallas (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Gaon, Maor, Brafman, Ronen I.

Reinforcement Learning with Non-Markovian Rewards

arXiv.org Artificial IntelligenceDec-5-2019

The standard RL world model is that of a Markov Decision Process (MDP). A basic premise of MDPs is that the rewards depend on the last state and action only. Yet, many real-world rewards are non-Markovian. For example, a reward for bringing coffee only if requested earlier and not yet served, is non-Markovian if the state only records current requests and deliveries. Past work considered the problem of modeling and solving MDPs with non-Markovian rewards (NMR), but we know of no principled approaches for RL with NMR. Here, we address the problem of policy learning from experience with such rewards. We describe and evaluate empirically four combinations of the classical RL algorithm Q-learning and R-max with automata learning algorithms to obtain new RL algorithms for domains with NMR. We also prove that some of these variants converge to an optimal policy in the limit.

algorithm, automata, automaton, (16 more...)

1912.02552

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Israel > Southern District > Beer-Sheva (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Kamalzadeh, Hossein, Sobhan, Saeid Nassim, Boskabadi, Azam, Hatami, Mohsen, Gharehyakheh, Amin

Modeling and Prediction of Iran's Steel Consumption Based on Economic Activity Using Support Vector Machines

arXiv.org Machine LearningDec-4-2019

The steel industry has great impacts on the economy and the environment of both developed and underdeveloped countries. The importance of this industry and these impacts have led many researchers to investigate the relationship between a country's steel consumption and its economic activity resulting in the so-called intensity of use model. This paper investigates the validity of the intensity of use model for the case of Iran's steel consumption and extends this hypothesis by using the indexes of economic activity to model the steel consumption. We use the proposed model to train support vector machines and predict the future values for Iran's steel consumption. The paper provides detailed correlation tests for the factors used in the model to check for their relationships with the steel consumption. The results indicate that Iran's steel consumption is strongly correlated with its economic activity following the same pattern as the economy has been in the last four decades.

artificial intelligence, banking & finance, steel consumption, (18 more...)

1912.02373

Country:

Asia > Middle East > Iran (1.00)
North America > United States > Texas (0.28)
Europe > United Kingdom (0.14)
North America > United States > Florida > Alachua County > Gainesville (0.14)

Genre: Research Report > Experimental Study (0.48)

Industry:

Materials > Metals & Mining > Steel (1.00)
Energy > Oil & Gas (1.00)
Banking & Finance > Economy (1.00)
Transportation (0.96)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Russel, Reazul Hasan, Behzadian, Bahram, Petrik, Marek

Optimizing Norm-Bounded Weighted Ambiguity Sets for Robust MDPs

arXiv.org Artificial IntelligenceDec-4-2019

Optimal policies in Markov decision processes (MDPs) are very sensitive to model misspecification. This raises serious concerns about deploying them in high-stake domains. Robust MDPs (RMDP) provide a promising framework to mitigate vulnerabilities by computing policies with worst-case guarantees in reinforcement learning. The solution quality of an RMDP depends on the ambiguity set, which is a quantification of model uncertainties. In this paper, we propose a new approach for optimizing the shape of the ambiguity sets for RMDPs. Our method departs from the conventional idea of constructing a norm-bounded uniform and symmetric ambiguity set. We instead argue that the structure of a near-optimal ambiguity set is problem specific. Our proposed method computes a weight parameter from the value functions, and these weights then drive the shape of the ambiguity sets. Our theoretical analysis demonstrates the rationale of the proposed idea. We apply our method to several different problem domains, and the empirical results further furnish the practical promise of weighted near-optimal ambiguity sets.

ambiguity, artificial intelligence, machine learning, (13 more...)

1912.02696

Country:

North America > United States > New Hampshire (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
North America > Canada (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)

Linzner, Dominik, Koeppl, Heinz

A Variational Perturbative Approach to Planning in Graph-based Markov Decision Processes

arXiv.org Machine LearningDec-4-2019

Coordinating multiple interacting agents to achieve a common goal is a difficult task with huge applicability. This problem remains hard to solve, even when limiting interactions to be mediated via a static interaction-graph. We present a novel approximate solution method for multi-agent Markov decision problems on graphs, based on variational perturbation theory. We adopt the strategy of planning via inference, which has been explored in various prior works. We employ a non-trivial extension of a novel high-order variational method that allows for approximate inference in large networks and has been shown to surpass the accuracy of existing variational methods. To compare our method to two state-of-the-art methods for multi-agent planning on graphs, we apply the method different standard GMDP problems. We show that in cases, where the goal is encoded as a non-local cost function, our method performs well, while state-of-the-art methods approach the performance of random guess. In a final experiment, we demonstrate that our method brings significant improvement for synchronization tasks.

gmdp, reward function, sabbadin, (14 more...)

1912.01849

Country: Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)

Genre: Research Report > Promising Solution (0.54)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.83)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.66)

arXiv.org Artificial IntelligenceDec-3-2019

Optimal Farsighted Agents Tend to Seek Power

Turner, Alexander Matt

Some researchers have speculated that capable reinforcement learning (RL) agents pursuing misspecified objectives are often incentivized to seek resources and power in pursuit of those objectives. An agent seeking power is incentivized to behave in undesirable ways, including rationally preventing deactivation and correction. Others have voiced skepticism: humans seem idiosyncratic in their urges to power, which need not be present in the agents we design. We formalize a notion of power within the context of finite deterministic Markov decision processes (MDPs). We prove that, with respect to a wide class of reward function distributions, optimal policies tend to seek power over the environment.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

1912.01683

Country:

North America > United States > Oregon (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

arXiv.org Machine LearningDec-3-2019

Continuous Online Learning and New Insights to Online Imitation Learning

Lee, Jonathan, Cheng, Ching-An, Goldberg, Ken, Boots, Byron

Online learning is a powerful tool for analyzing iterative algorithms. However, the classic adversarial setup sometimes fails to capture certain regularity in online problems in practice. Motivated by this, we establish a new setup, called Continuous Online Learning (COL), where the gradient of online loss function changes continuously across rounds with respect to the learner's decisions. We show that COL covers and more appropriately describes many interesting applications, from general equilibrium problems (EPs) to optimization in episodic MDPs. Using this new setup, we revisit the difficulty of achieving sublinear dynamic regret. We prove that there is a fundamental equivalence between achieving sublinear dynamic regret in COL and solving certain EPs, and we present a reduction from dynamic regret to both static regret and convergence rate of the associated EP. At the end, we specialize these new insights into online imitation learning and show improved understanding of its learning stability.

algorithm, dynamic regret, sublinear dynamic regret, (15 more...)

1912.01261

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
North America > Canada (0.04)

Genre: Research Report (0.50)

Industry: Education > Educational Setting > Online (0.85)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)