excessrisk
- North America > United States (0.14)
- Asia > China > Hong Kong (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > China > Hong Kong (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > Italy (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > Italy (0.04)
- Asia > Middle East > Jordan (0.04)
- Education (0.48)
- Information Technology (0.46)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > China > Hong Kong (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > China > Hong Kong (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Understanding the Gains from Repeated Self-Distillation
Pareek, Divyansh, Du, Simon S., Oh, Sewoong
Self-Distillation is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, self-distillation has been empirically observed to improve performance, especially when applied repeatedly. For such a process, there is a fundamental question of interest: How much gain is possible by applying multiple steps of self-distillation? To investigate this relative gain, we propose studying the simple but canonical task of linear regression. Our analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor as large as $d$, where $d$ is the input dimension. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.
- North America > United States > Washington > King County > Seattle (0.14)
- Europe > Italy (0.04)
- Asia > Middle East > Jordan (0.04)
The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift
Wu, Jingfeng, Zou, Difan, Braverman, Vladimir, Gu, Quanquan, Kakade, Sham M.
In transfer learning (Pan and Yang, 2009; Sugiyama and Kawanabe, 2012), an algorithm is provided with abundant data from a source domain and scarce or no data from a target domain, and aims to train a model that generalizes well on the target domain. A simple yet effective approach is to pretrain a model with the rich source data and then finetune the model with the available target data via, e.g., stochastic gradient descent (SGD) (see, e.g., Yosinski et al. (2014)). Despite its wide applicability in practice, the power and limitation of the pretraining-finetuning based transfer learning framework is not fully understood in theory. The focus of this work is to consider this issue in a specific transfer learning setup known as covariate shift (Pan and Yang, 2009; Sugiyama and Kawanabe, 2012), where the source and target distributions differ in their marginal distributions over the input, but coincide in their conditional distribution of the output given the input. Regarding the theory of learning with covariate shift, there exists a rich set of results (Ben-David et al., 2010; Germain et al., 2013; Mansour et al., 2009; Mohri and Muñoz Medina, 2012; Cortes and
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)