AITopics

2006.06051

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceOct-9-2020

Linear Mode Connectivity in Multitask and Continual Learning

Mirzadeh, Seyed Iman, Farajtabar, Mehrdad, Gorur, Dilan, Pascanu, Razvan, Ghasemzadeh, Hassan

Continual (sequential) training and multitask (simultaneous) training are often attempting to solve the same overall objective: to find a solution that performs well on all considered tasks. The main difference is in the training regimes, where continual learning can only have access to one task at a time, which for neural networks typically leads to catastrophic forgetting. That is, the solution found for a subsequent task does not perform well on the previous ones anymore. However, the relationship between the different minima that the two training regimes arrive at is not well understood. Is there a local structure that could explain the difference in performance achieved by the two different schemes? Motivated by recent work showing that different minima of the same task are typically connected by very simple curves of low error, we investigate whether multitask and continual solutions are similarly connected. We empirically find that indeed such connectivity can be reliably achieved and, more interestingly, it can be done by a linear path, conditioned on having the same initialization for both. We thoroughly analyze this observation and discuss its significance for the continual learning process. Furthermore, we exploit this finding to propose an effective algorithm that constrains the sequentially learned minima to behave as the multitask solution. One major consequence of learning multiple tasks in a continual learning (CL) setting -- where tasks are learned sequentially, and the model can only have access to one task at a time -- is catastrophic forgetting (McCloskey & Cohen, 1989).

deep learning, minima, neural network, (16 more...)

2010.04495

Country: North America > United States (0.93)

Genre: Research Report (1.00)

Industry: Education (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceOct-5-2020

The Effectiveness of Memory Replay in Large Scale Continual Learning

Balaji, Yogesh, Farajtabar, Mehrdad, Yin, Dong, Mott, Alex, Li, Ang

We study continual learning in the large scale setting where tasks in the input sequence are not limited to classification, and the outputs can be of high dimension. Among multiple state-of-the-art methods, we found vanilla experience replay (ER) still very competitive in terms of both performance and scalability, despite its simplicity. However, a degraded performance is observed for ER with small memory. A further visualization of the feature space reveals that the intermediate representation undergoes a distributional drift. While existing methods usually replay only the input-output pairs, we hypothesize that their regularization effect is inadequate for complex deep models and diverse tasks with small replay buffer size. Following this observation, we propose to replay the activation of the intermediate layers in addition to the input-output pairs. Considering that saving raw activation maps can dramatically increase memory and compute cost, we propose the Compressed Activation Replay technique, where compressed representations of layer activation are saved to the replay buffer. We show that this approach can achieve superior regularization effect while adding negligible memory overhead to replay method. Experiments on both the large-scale Taskonomy benchmark with a diverse set of tasks and standard common datasets (Split-CIFAR and Split-miniImageNet) demonstrate the effectiveness of the proposed method. Humans naturally learn concepts and tasks in a sequential order without degrading performance on the previous ones.

artificial intelligence, continual learning, neural network, (15 more...)

2010.02418

Country: North America > United States > Maryland (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceJun-17-2020

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

Lazic, Nevena, Yin, Dong, Farajtabar, Mehrdad, Levine, Nir, Gorur, Dilan, Harris, Chris, Schuurmans, Dale

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments.

artificial intelligence, function approximation, machine learning, (15 more...)

2006.1262

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Maximum Entropy (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

arXiv.org Machine LearningJun-12-2020

Understanding the Role of Training Regimes in Continual Learning

Mirzadeh, Seyed Iman, Farajtabar, Mehrdad, Pascanu, Razvan, Ghasemzadeh, Hassan

Catastrophic forgetting affects the training of neural networks, limiting their ability to learn multiple tasks sequentially. From the perspective of the well established plasticity-stability dilemma, neural networks tend to be overly plastic, lacking the stability necessary to prevent the forgetting of previous knowledge, which means that as learning progresses, networks tend to forget previously seen tasks. This phenomenon coined in the continual learning literature, has attracted much attention lately, and several families of approaches have been proposed with different degrees of success. However, there has been limited prior work extensively analyzing the impact that different training regimes -- learning rate, batch size, regularization method-- can have on forgetting. In this work, we depart from the typical approach of altering the learning algorithm to improve stability. Instead, we hypothesize that the geometrical properties of the local minima found for each task play an important role in the overall degree of forgetting. In particular, we study the effect of dropout, learning rate decay, and batch size, on forming training regimes that widen the tasks' local minima and consequently, on helping it not to forget catastrophically. Our study provides practical insights to improve stability via simple yet effective techniques that outperform alternative baselines.

continual learning, deep learning, neural network, (16 more...)

2006.06958

Country: North America > United States (0.46)

Genre: Research Report > Experimental Study (0.34)

Industry: Education (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Machine LearningOct-15-2019

Orthogonal Gradient Descent for Continual Learning

Farajtabar, Mehrdad, Azizan, Navid, Mott, Alex, Li, Ang

Orthogonal Gradient Descent for Continual LearningMehrdad Farajtabar Navid Azizan 1 Alex Mott Ang Li DeepMind CalTech DeepMind DeepMind Abstract Neural networks are achieving state of the art and sometimes superhuman performance on learning tasks across a variety of domains. Whenever these problems require learning in a continual or sequential manner, however, neural networks suffer from the problem of catastrophic forgetting; they forget how to solve previous tasks after being trained on a new task, despite having the essential capacity to solve both tasks if they were trained on both simultaneously. In this paper, we propose to address this issue from a parameter space perspective and study an approach to restrict the direction of the gradient updates to avoid forgetting previously-learned data. We present the Orthogonal Gradient Descent (OGD) method, which accomplishes this goal by projecting the gradients from new tasks onto a subspace in which the neural network output on previous task does not change and the projected gradient is still in a useful direction for learning the new task. Our approach utilizes the high capacity of a neural network more efficiently and does not require storing the previously learned data that might raise privacy concerns. Experiments on common benchmarks reveal the effectiveness of the proposed OGD method. 1 Introduction One critical component of intelligence is the ability to learn continuously, when new information is constantly available but previously presented information is unavailable to retrieve. Despite their ubiquity in the real world, these problems have posed a longstanding challenge to artificial intelligence (Thrun and Mitchell, 1995; Hassabis et al., 2017).Correspondence to farajtabar@google.com. 1 Work done during an internship at DeepMind. A typical neural network training procedure over a sequence of different tasks usually results in degraded performance on previously trained tasks if the model could not revisit the data of previous tasks. This phenomenon is called catastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990; French, 1999). Ideally, an intelligent agent should be able to learn consecutive tasks without degrading its performance on those already learned. With the deep learning renaissance (Krizhevsky et al., 2012; Hinton et al., 2006; Si-monyan and Zisserman, 2014) this problem has been revived (Srivastava et al., 2013; Goodfellow et al., 2013) with many followup studies (Parisi et al., 2019). One probable reason for this phenomenon is that neural networks are usually trained by Stochastic Gradient Descent (SGD)--or its variants--where the op-timizers produce gradients that are oblivious to past knowledge.

arxiv preprint arxiv, deep learning, neural network, (17 more...)

1910.07104

Genre: Research Report > New Finding (0.46)

Industry:

Education (0.68)
Health & Medicine (0.46)
Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningFeb-9-2019

Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher

Mirzadeh, Seyed-Iman, Farajtabar, Mehrdad, Li, Ang, Ghasemzadeh, Hassan

Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too gigantic to be deployed on edge devices like smart-phones or embedded sensor nodes. There has been efforts to compress these networks, and a popular method is knowledge distillation, where a large (a.k.a. teacher) pre-trained network is used to train a smaller (a.k.a. student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation which employs an intermediate-sized network (a.k.a. teacher assistant) to bridge the gap between the student and the teacher. We study the effect of teacher assistant size and extend the framework to multi-step distillation. Moreover, empirical and theoretical analysis are conducted to analyze the teacher assistant knowledge distillation framework. Extensive experiments on CIFAR-10 and CIFAR-100 datasets and plain CNN and ResNet architectures substantiate the effectiveness of our proposed approach.

deep learning, knowledge distillation, neural network, (21 more...)

1902.03393

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

arXiv.org Artificial IntelligenceMay-23-2018

More Robust Doubly Robust Off-policy Evaluation

Farajtabar, Mehrdad, Chow, Yinlam, Ghavamzadeh, Mohammad

We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of a policy from the data generated by another policy(ies). In particular, we focus on the doubly robust (DR) estimators that consist of an importance sampling (IS) component and a performance model, and utilize the low (or zero) bias of IS and low variance of the model at the same time. Although the accuracy of the model has a huge impact on the overall performance of DR, most of the work on using the DR estimators in OPE has been focused on improving the IS part, and not much on how to learn the model. In this paper, we propose alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator. We first present a formulation for learning the DR model in RL. We then derive formulas for the variance of the DR estimator in both contextual bandits and RL, such that their gradients w.r.t.~the model parameters can be estimated from the samples, and propose methods to efficiently minimize the variance. We prove that the MRDR estimators are strongly consistent and asymptotically optimal. Finally, we evaluate MRDR in bandits and RL benchmark problems, and compare its performance with the existing methods.

artificial intelligence, estimator, reinforcement learning, (15 more...)

1802.03493

Country: Europe > Sweden (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningMar-16-2018

Representation Learning over Dynamic Graphs

Trivedi, Rakshit, Farajtabar, Mehrdad, Biswal, Prasenjeet, Zha, Hongyuan

How can we effectively encode evolving information over dynamic graphs into low-dimensional representations? In this paper, we propose DyRep, an inductive deep representation learning framework that learns a set of functions to efficiently produce low-dimensional node embeddings that evolves over time. The learned embeddings drive the dynamics of two key processes namely, communication and association between nodes in dynamic graphs. These processes exhibit complex nonlinear dynamics that evolve at different time scales and subsequently contribute to the update of node embeddings. We employ a time-scale dependent multivariate point process model to capture these dynamics. We devise an efficient unsupervised learning procedure and demonstrate that our approach significantly outperforms representative baselines on two real-world datasets for the problem of dynamic link prediction and event time prediction.

artificial intelligence, neural network, node, (21 more...)

1803.04051

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

AAAI ConferencesFeb-8-2018

Learning Conditional Generative Models for Temporal Point Processes

Xiao, Shuai (Shanghai Jiao Tong University) | Xu, Hongteng (Duke University) | Yan, Junchi (Shanghai Jiao Tong University) | Farajtabar, Mehrdad (Georgia Institute of Technology) | Yang, Xiaokang (Shanghai Jiao Tong University) | Song, Le (Georgia Institute of Technology) | Zha, Hongyuan (Georgia Institute of Technology)

Our learning method is based on the following two facts: On one hand, MLE loss or KL divergence requires strict The ability of looking into the future is a challenging but luring matching between two probability distributions and is nonbiased task. People are willing to estimate the occurrence probability estimation of parameters, which is sensitive to sample for their interested events so that they can take preemptive noise and outliers; on the other hand, unlike MLE loss, action. For example, after reviewing the admission which does not consider how close two samples are but only history of patients, the doctors may give kind warning for the their relatively probability, Wasserstein distance is sensitive patients who are at high risk of certain diseases. When having to the underlying geometry structure of samples but has biased access to working experience of job seekers, headhunters gradients(Bellemare et al. 2017). To take advantage of can evaluate one's future career path and recommend a suitable the strengths of these two methods and mitigate the bias position at proper time. In these cases, the historical observations exposure in long-term prediction, our method incorporate always provide us with important guidance to predict Wasserstein distance besides MLE -- both the KL divergence future events -- not only the order of events but also the and the Wasserstein distance between generated and time span between them contain useful information about real samples are minimized jointly.

deep learning, neural network, sequence, (20 more...)

AAAI Conferences

Thirty-Second AAAI Conference on Artificial Intelligence

Country: Asia > China (0.14)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Natural Language (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.69)