Transfer Learning
Blissful Ignorance: Anti-Transfer Learning for Task Invariance
Guizzo, Eric, Weyde, Tillman, Tarroni, Giacomo
We introduce the novel concept of anti-transfer learning for neural networks. While standard transfer learning assumes that the representations learned in one task will be useful for another task, anti-transfer learning avoids learning representations that have been learned for a different task, which is not relevant and potentially misleading for the new task and should be ignored. Examples of such tasks are style vs content recognition or pitch vs timbre from audio. By penalizing similarity between the second network and the previously learned features, co-incidental correlations between the target and the unrelated task can be avoided, yielding more reliable representations and better performance on the target task. We implemented anti-transfer learning with different similarity metrics and aggregation functions. We evaluate the approach in the audio domain with different tasks and setups, using four datasets in total. The results show that anti-transfer learning consistently improves accuracy in all test cases, proving that it can push the network to learn more representative features for the task at hand.
Multi-step Estimation for Gradient-based Meta-learning
Kim, Jin-Hwa, Park, Junyoung, Choi, Yongseok
Gradient-based meta-learning approaches have been successful in few-shot learning, transfer learning, and a wide range of other domains. Despite its efficacy and simplicity, the burden of calculating the Hessian matrix with large memory footprints is the critical challenge in large-scale applications. To tackle this issue, we propose a simple yet straightforward method to reduce the cost by reusing the same gradient in a window of inner steps. We describe the dynamics of the multi-step estimation in the Lagrangian formalism and discuss how to reduce evaluating second-order derivatives estimating the dynamics. To validate our method, we experiment on meta-transfer learning and few-shot learning tasks for multiple settings. The experiment on meta-transfer emphasizes the applicability of training meta-networks, where other approximations are limited. For few-shot learning, we evaluate time and memory complexities compared with popular baselines. We show that our method significantly reduces training time and memory usage, maintaining competitive accuracies, or even outperforming in some cases.
Continuous Transfer Learning with Label-informed Distribution Alignment
Transfer learning has been successfully applied across many high-impact applications. However, most existing work focuses on the static transfer learning setting, and very little is devoted to modeling the time evolving target domain, such as the online reviews for movies. To bridge this gap, in this paper, we study a novel continuous transfer learning setting with a time evolving target domain. One major challenge associated with continuous transfer learning is the potential occurrence of negative transfer as the target domain evolves over time. To address this challenge, we propose a novel label-informed C-divergence between the source and target domains in order to measure the shift of data distributions as well as to identify potential negative transfer. We then derive the error bound for the target domain using the empirical estimate of our proposed C-divergence. Furthermore, we propose a generic adversarial Variational Auto-encoder framework named TransLATE by minimizing the classification error and C-divergence of the target domain between consecutive time stamps in a latent feature space. In addition, we define a transfer signature for characterizing the negative transfer based on C-divergence, which indicates that larger C-divergence implies a higher probability of negative transfer in real scenarios. Extensive experiments on synthetic and real data sets demonstrate the effectiveness of our TransLATE framework.
Multitask Learning with Single Gradient Step Update for Task Balancing
Multitask learning is a methodology to boost generalization performance and also reduce computational intensity and memory usage. However, learning multiple tasks simultaneously can be more difficult than learning a single task because it can cause imbalance among tasks. To address the imbalance problem, we propose an algorithm to balance between tasks at the gradient level by applying gradient-based meta-learning to multitask learning. The proposed method trains shared layers and task-specific layers separately so that the two layers with different roles in a multitask network can be fitted to their own purposes. In particular, the shared layer that contains informative knowledge shared among tasks is trained by employing single gradient step update and inner/outer loop training to mitigate the imbalance problem at the gradient level. We apply the proposed method to various multitask computer vision problems and achieve state-of-the-art performance.
A Survey on Transfer Learning in Natural Language Processing
Alyafeai, Zaid, AlShaibani, Maged Saeed, Ahmad, Irfan
Deep learning models usually require a huge amount of data. However, these large datasets are not always attainable. This is common in many challenging NLP tasks. Consider Neural Machine Translation, for instance, where curating such large datasets may not be possible specially for low resource languages. Another limitation of deep learning models is the demand for huge computing resources. These obstacles motivate research to question the possibility of knowledge transfer using large trained models. The demand for transfer learning is increasing as many large models are emerging. In this survey, we feature the recent transfer learning advances in the field of NLP. We also provide a taxonomy for categorizing different transfer learning approaches from the literature.
Meta-learning with Stochastic Linear Bandits
Cella, Leonardo, Lazaric, Alessandro, Pontil, Massimiliano
We investigate meta-learning procedures in the setting of stochastic linear bandits tasks. The goal is to select a learning algorithm which works well on average over a class of bandits tasks, that are sampled from a task-distribution. Inspired by recent work on learning-to-learn linear regression, we consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We first study the benefit of the biased OFUL algorithm in terms of regret minimization. We then propose two strategies to estimate the bias within the learning-to-learn setting. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
Information-theoretic analysis for transfer learning
Wu, Xuetong, Manton, Jonathan H., Aickelin, Uwe, Zhu, Jingge
Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different distributions (denoted as $\mu$ and $\mu'$, respectively). In this work, we give an information-theoretic analysis on the generalization error and the excess risk of transfer learning algorithms, following a line of work initiated by Russo and Zhou. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence $D(mu||mu')$ plays an important role in characterizing the generalization error in the settings of domain adaptation. Specifically, we provide generalization error upper bounds for general transfer learning algorithms and extend the results to a specific empirical risk minimization (ERM) algorithm where data from both distributions are available in the training phase. We further apply the method to iterative, noisy gradient descent algorithms, and obtain upper bounds which can be easily calculated, only using parameters from the learning algorithms. A few illustrative examples are provided to demonstrate the usefulness of the results. In particular, our bound is tighter in specific classification problems than the bound derived using Rademacher complexity.
A network-based transfer learning approach to improve sales forecasting of new products
Karb, Tristan, Kühl, Niklas, Hirt, Robin, Glivici-Cotruta, Varvara
Data-driven methods -- such as machine learning and time series forecasting -- are widely used for sales forecasting in the food retail domain. However, for newly introduced products insufficient training data is available to train accurate models. In this case, human expert systems are implemented to improve prediction performance. Human experts rely on their implicit and explicit domain knowledge and transfer knowledge about historical sales of similar products to forecast new product sales. By applying the concept of Transfer Learning, we propose an analytical approach to transfer knowledge between listed stock products and new products. A network-based Transfer Learning approach for deep neural networks is designed to investigate the efficiency of Transfer Learning in the domain of food sales forecasting. Furthermore, we examine how knowledge can be shared across different products and how to identify the products most suitable for transfer. To test the proposed approach, we conduct a comprehensive case study for a newly introduced product, based on data of an Austrian food retailing company. The experimental results show, that the prediction accuracy of deep neural networks for food sales forecasting can be effectively increased using the proposed approach.
How Much Off-The-Shelf Knowledge Is Transferable From Natural Images To Pathology Images?
Li, Xingyu, Plataniotis, Konstantinos N.
Deep learning has achieved a great success in natural image classification. To overcome data-scarcity in computational pathology, recent studies exploit transfer learning to reuse knowledge gained from natural images in pathology image analysis, aiming to build effective pathology image diagnosis models. Since transferability of knowledge heavily depends on the similarity of the original and target tasks, significant differences in image content and statistics between pathology images and natural images raise the questions: how much knowledge is transferable? Is the transferred information equally contributed by pre-trained layers? To answer these questions, this paper proposes a framework to quantify knowledge gain by a particular layer, conducts an empirical investigation in pathology image centered transfer learning, and reports some interesting observations. Particularly, compared to the performance baseline obtained by random-weight model, though transferability of off-the-shelf representations from deep layers heavily depend on specific pathology image sets, the general representation generated by early layers does convey transferred knowledge in various image classification applications. The observation in this study encourages further investigation of specific metric and tools to quantify effectiveness and feasibility of transfer learning in future.
Towards Knowledgeable Supervised Lifelong Learning Systems
Benavides-Prado, Diana (The University of Auckland) | Koh, Yun Sing | Riddle, Patricia
Learning a sequence of tasks is a long-standing challenge in machine learning. This setting applies to learning systems that observe examples of a range of tasks at different points in time. A learning system should become more knowledgeable as more related tasks are learned. Although the problem of learning sequentially was acknowledged for the first time decades ago, the research in this area has been rather limited. Research in transfer learning, multitask learning, metalearning and deep learning has studied some challenges of these kinds of systems. Recent research in lifelong machine learning and continual learning has revived interest in this problem. We propose Proficiente, a full framework for long-term learning systems. Proficiente relies on knowledge transferred between hypotheses learned with Support Vector Machines. The first component of the framework is focused on transferring forward selectively from a set of existing hypotheses or functions representing knowledge acquired during previous tasks to a new target task. A second component of Proficiente is focused on transferring backward, a novel ability of long-term learning systems that aim to exploit knowledge derived from recent tasks to encourage refinement of existing knowledge. We propose a method that transfers selectively from a task learned recently to existing hypotheses representing previous tasks. The method encourages retention of existing knowledge whilst refining. We analyse the theoretical properties of the proposed framework. Proficiente is accompanied by an agnostic metric that can be used to determine if a long-term learning system is becoming more knowledgeable. We evaluate Proficiente in both synthetic and real-world datasets, and demonstrate scenarios where knowledgeable supervised learning systems can be achieved by means of transfer.