hard parameter
Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing
Yan, Brian, Chang, Xuankai, Anastasopoulos, Antonios, Fujita, Yuya, Watanabe, Shinji
Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modally. Our method reduces the speech-text modality gap via a pre-processing stage which converts speech and text inputs into two discrete token sequences of similar length -- this allows models to indiscriminately process both modalities simply using a joint vocabulary. With experiments on MuST-C, we demonstrate that our multi-tasking framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU without any external MT data. Further, we show that this framework incorporates external MT data, yielding +0.8 BLEU, and also improves transfer learning from pre-trained textual models, yielding +1.8 BLEU.
Multi-task learning in Machine Learning
In most machine learning contexts, we are concerned with solving a single task at a time. Regardless of what that task is, the problem is typically framed as using data to solve a single task or optimize a single metric at a time. However, this approach will eventually hit a performance ceiling, oftentimes due to the size of the data-set or the ability of the model to learn meaningful representations from it. Multi-task learning, on the other hand, is a machine learning approach in which we try to learn multiple tasks simultaneously, optimizing multiple loss functions at once. Rather than training independent models for each task, we allow a single model to learn to complete all of the tasks at once. In this process, the model uses all of the available data across the different tasks to learn generalized representations of the data that are useful in multiple contexts.
Sharp Bias-variance Tradeoffs of Hard Parameter Sharing in High-dimensional Linear Regression
Zhang, Hongyang R., Yang, Fan, Wu, Sen, Su, Weijie J., Ré, Christopher
Hard parameter sharing for multi-task learning is widely used in empirical research despite the fact that its generalization properties have not been well established in many cases. This paper studies its generalization properties in a fundamental setting: How does hard parameter sharing work given multiple linear regression tasks? We develop new techniques and establish a number of new results in the high-dimensional setting, where the sample size and feature dimension increase at a fixed ratio. First, we show a sharp bias-variance decomposition of hard parameter sharing, given multiple tasks with the same features. Second, we characterize the asymptotic bias-variance limit for two tasks, even when they have arbitrarily different sample size ratios and covariate shifts. We also demonstrate that these limiting estimates for the empirical loss are incredibly accurate in moderate dimensions. Finally, we explain an intriguing phenomenon where increasing one task's sample size helps another task initially by reducing variance but hurts eventually due to increasing bias. This suggests progressively adding data for optimizing hard parameter sharing, and we validate its efficiency in text classification tasks.
A Brief Review of Deep Multi-task Learning and Auxiliary Task Learning
Vafaeikia, Partoo, Namdar, Khashayar, Khalvati, Farzad
Multi-task learning (MTL) is broadly used across various applications of machine learning and has several advantages in comparison with the single-task learning. Since layers are shared between different tasks and features are not repeatedly calculated for each task, the amount of memory used is reduced and the inference speed is improved. In addition, if tasks share complimentary information, they act as regularizers for each other which results in the improvement of the prediction performance of each task [1]. This has been proven in various areas such as detection and classification [2], computer vision [3, 4], depth estimation [5], natural language processing [6-8] and drug discovery [9]. The goal of this review paper is to provide an overview of various deep multi-task learning (dMTL) solutions and possible improvements in performance through efficient auxiliary tasks selection.
Learning what to share between loosely related tasks
Ruder, Sebastian, Bingel, Joachim, Augenstein, Isabelle, Søgaard, Anders
Multi-task learning is motivated by the observation that humans bring to bear what they know about related problems when solving new ones. Similarly, deep neural networks can profit from related tasks by sharing parameters with other networks. However, humans do not consciously decide to transfer knowledge between tasks. In Natural Language Processing (NLP), it is hard to predict if sharing will lead to improvements, particularly if tasks are only loosely related. To overcome this, we introduce Sluice Networks, a general framework for multi-task learning where trainable parameters control the amount of sharing. Our framework generalizes previous proposals in enabling sharing of all combinations of subspaces, layers, and skip connections. We perform experiments on three task pairs, and across seven different domains, using data from OntoNotes 5.0, and achieve up to 15% average error reductions over common approaches to multi-task learning. We show that a) label entropy is predictive of gains in sluice networks, confirming findings for hard parameter sharing and b) while sluice networks easily fit noise, they are robust across domains in practice.