AITopics

#artificialintelligenceFeb-26-2021, 04:55:33 GMT

How to Kickstart an AI Venture Without Proprietary Data

A few years ago, I learned about the billions of dollars banks lose to credit card fraud on an annual basis. Better detection or prediction of fraud would be incredibly valuable. And so I considered the possibility of convincing a bank to share their transactional data in the hope of building a better fraud detection algorithm. The catch, unsurprisingly, was that no major bank is willing to share such data. They feel they're better off hiring a team of data scientists to work on the problem internally. My startup idea died a quick death.

ai system, available data, entrepreneur, (16 more...)

#artificialintelligence

Country: North America > Canada (0.05)

Industry:

Health & Medicine (1.00)
Banking & Finance (1.00)
Law Enforcement & Public Safety > Fraud (0.69)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.33)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.32)

arXiv.org Artificial IntelligenceFeb-26-2021

Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning

Kozuno, Tadashi, Tang, Yunhao, Rowland, Mark, Munos, Rémi, Kapturowski, Steven, Dabney, Will, Valko, Michal, Abel, David

Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng's Q($\lambda$), a representative example of non-conservative algorithms. We prove that it also converges to an optimal policy provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng's Q($\lambda$) in complex continuous control tasks, confirming that Peng's Q($\lambda$) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng's Q($\lambda$), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.

algorithm, peng, revisiting peng, (17 more...)

2103.00107

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Monterey County > Pacific Grove (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceFeb-26-2021

On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning

Zhang, Baohe, Rajan, Raghu, Pineda, Luis, Lambert, Nathan, Biedenkapp, André, Chua, Kurtland, Hutter, Frank, Calandra, Roberto

Model-based Reinforcement Learning (MBRL) is a promising framework for learning control in a data-efficient manner. MBRL algorithms can be fairly complex due to the separate dynamics modeling and the subsequent planning algorithm, and as a result, they often possess tens of hyperparameters and architectural choices. For this reason, MBRL typically requires significant human expertise before it can be applied to new problems and domains. To alleviate this problem, we propose to use automatic hyperparameter optimization (HPO). We demonstrate that this problem can be tackled effectively with automated HPO, which we demonstrate to yield significantly improved performance compared to human experts. In addition, we show that tuning of several MBRL hyperparameters dynamically, i.e. during the training itself, further improves the performance compared to using static hyperparameters which are kept fixed for the whole training. Finally, our experiments provide valuable insights into the effects of several hyperparameters, such as plan horizon or learning rate and their influence on the stability of training and resulting rewards.

configuration, hyperparameter, log 10, (15 more...)

2102.13651

Country:

Europe > Germany > Baden-Württemberg > Freiburg (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > Experimental Study (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningFeb-25-2021

Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally

Rajaraman, Nived, Han, Yanjun, Yang, Lin F., Ramchandran, Kannan, Jiao, Jiantao

We study the statistical limits of Imitation Learning (IL) in episodic Markov Decision Processes (MDPs) with a state space $\mathcal{S}$. We focus on the known-transition setting where the learner is provided a dataset of $N$ length-$H$ trajectories from a deterministic expert policy and knows the MDP transition. We establish an upper bound $O(|\mathcal{S}|H^{3/2}/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al (2020) which we prove to be computationally efficient. In contrast, we show the minimax suboptimality grows as $\Omega( H^{3/2}/N)$ when $|\mathcal{S}|\geq 3$ while the unknown-transition setting suffers from a larger sharp rate $\Theta(|\mathcal{S}|H^2/N)$ (Rajaraman et al (2020)). The lower bound is established by proving a two-way reduction between IL and the value estimation problem of the unknown expert policy under any given reward function, as well as building connections with linear functional estimation with subsampled observations. We further show that under the additional assumption that the expert is optimal for the true reward function, there exists an efficient algorithm, which we term as Mimic-Mixture, that provably achieves suboptimality $O(1/N)$ for arbitrary 3-state MDPs with rewards only at the terminal layer. In contrast, no algorithm can achieve suboptimality $O(\sqrt{H}/N)$ with high probability if the expert is not constrained to be optimal. Our work formally establishes the benefit of the expert optimal assumption in the known transition setting, while Rajaraman et al (2020) showed it does not help when transitions are unknown.

algorithm, probability, trajectory, (15 more...)

arXiv.org Machine Learning

2102.12948

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Zhuang, Vincent, Sui, Yanan

No-Regret Reinforcement Learning with Heavy-Tailed Rewards

arXiv.org Machine LearningFeb-25-2021

Reinforcement learning algorithms typically assume rewards to be sampled from light-tailed distributions, such as Gaussian or bounded. However, a wide variety of real-world systems generate rewards that follow heavy-tailed distributions. We consider such scenarios in the setting of undiscounted reinforcement learning. By constructing a lower bound, we show that the difficulty of learning heavy-tailed rewards asymptotically dominates the difficulty of learning transition probabilities. Leveraging techniques from robust mean estimation, we propose Heavy-UCRL2 and Heavy-Q-Learning, and show that they achieve near-optimal regret bounds in this setting. Our algorithms also naturally generalize to deep reinforcement learning applications; we instantiate Heavy-DQN as an example of this. We demonstrate that all of our algorithms outperform baselines on both synthetic MDPs and standard RL benchmarks.

algorithm, heavy-q-learning, probability, (11 more...)

arXiv.org Machine Learning

2102.12769

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Fan, Jiameng, Li, Wenchao

Robust Deep Reinforcement Learning via Multi-View Information Bottleneck

Deep reinforcement learning (DRL) agents are often sensitive to visual changes that were unseen in their training environments. To address this problem, we introduce a robust representation learning approach for RL. We introduce an auxiliary objective based on the multi-view information bottleneck (MIB) principle which encourages learning representations that are both predictive of the future and less sensitive to task-irrelevant distractions. This enables us to train high-performance policies that are robust to visual distractions and can generalize to unseen environments. We demonstrate that our approach can achieve SOTA performance on challenging visual control tasks, even when the background is replaced with natural videos. In addition, we show that our approach outperforms well-established baselines on generalization to unseen environments using the large-scale Procgen benchmark.

information, mutual information, representation, (14 more...)

2102.13268

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre: Research Report (0.81)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Off-Policy Imitation Learning from Observations

Zhu, Zhuangdi, Lin, Kaixiang, Dai, Bo, Zhou, Jiayu

Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit through the reuse of incomplete resources. Compared to conventional imitation learning (IL), LfO is more challenging because of the lack of expert action guidance. In both conventional IL and LfO, distribution matching is at the heart of their foundation. Traditional distribution matching approaches are sample-costly which depend on on-policy transitions for policy learning. Towards sample-efficiency, some off-policy solutions have been proposed, which, however, either lack comprehensive theoretical justifications or depend on the guidance of expert actions. In this work, we propose a sample-efficient LfO approach which enables off-policy optimization in a principled manner. To further accelerate the learning procedure, we regulate the policy update with an inverse action model, which assists distribution matching from the perspective of mode-covering. Extensive empirical results on challenging locomotion tasks indicate that our approach is comparable with state-of-the-art in terms of both sample-efficiency and asymptotic performance.

interaction step, learning, objective, (14 more...)

2102.13185

Country:

North America > United States > Michigan (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Hejna, Donald J. III, Abbeel, Pieter, Pinto, Lerrel

Task-Agnostic Morphology Evolution

Deep reinforcement learning primarily focuses on learning behavior, usually overlooking the fact that an agent's function is largely determined by form. So, how should one go about finding a morphology fit for solving tasks in a given environment? Current approaches that co-adapt morphology and behavior use a specific task's reward as a signal for morphology optimization. However, this often requires expensive policy optimization and results in task-dependent morphologies that are not built to generalize. In this work, we propose a new approach, Task-Agnostic Morphology Evolution (TAME), to alleviate both of these issues. Without any task or reward specification, TAME evolves morphologies by only applying randomly sampled action primitives on a population of agents. This is accomplished using an information-theoretic objective that efficiently ranks agents by their ability to reach diverse states in the environment and the causality of their actions. Finally, we empirically demonstrate that across 2D, 3D, and manipulation environments TAME can evolve morphologies that match the multi-task performance of those learned with task supervised algorithms. Our code and videos can be found at https://sites.google.com/view/task-agnostic-evolution.

action primitive, agent, morphology, (15 more...)

2102.131

Country: North America > United States > New York (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Topin, Nicholay, Milani, Stephanie, Fang, Fei, Veloso, Manuela

Iterative Bounding MDPs: Learning Interpretable Policies via Non-Interpretable Methods

Current work in explainable reinforcement learning generally produces policies in the form of a decision tree over the state space. Such policies can be used for formal safety verification, agent behavior prediction, and manual inspection of important features. However, existing approaches fit a decision tree after training or use a custom learning procedure which is not compatible with new learning techniques, such as those which use neural networks. To address this limitation, we propose a novel Markov Decision Process (MDP) type for learning decision tree policies: Iterative Bounding MDPs (IBMDPs). An IBMDP is constructed around a base MDP so each IBMDP policy is guaranteed to correspond to a decision tree policy for the base MDP when using a method-agnostic masking procedure. Because of this decision tree equivalence, any function approximator can be used during training, including a neural network, while yielding a decision tree policy for the base MDP. We present the required masking procedure as well as a modified value update step which allows IBMDPs to be solved using existing algorithms. We apply this procedure to produce IBMDP variants of recent reinforcement learning methods. We empirically show the benefits of our approach by solving IBMDPs to produce decision tree policies for the base MDPs.

agent, base mdp, mdp, (15 more...)

2102.13045

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.82)

Industry:

Government > Military (0.68)
Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.66)