Goto

Collaborating Authors

 excessrisk


d5c04aa72b92c53bda5b525b60958295-Supplemental-Conference.pdf

Neural Information Processing Systems

Westudy linear regression under covariate shift, where themarginal distribution over the input covariates differs in the source and the target domains, while the conditional distribution of the output given the input covariates is similar across thetwodomains.





12265_the_power_and_limitation_of_pr

Jingfeng Wu

Neural Information Processing Systems

In addition, we show that finetuning, even with only a small amount of target data, could drastically reduce the amount of source data required by pretraining.



Understanding the Gains from Repeated Self-Distillation

Pareek, Divyansh, Du, Simon S., Oh, Sewoong

arXiv.org Machine Learning

Self-Distillation is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, self-distillation has been empirically observed to improve performance, especially when applied repeatedly. For such a process, there is a fundamental question of interest: How much gain is possible by applying multiple steps of self-distillation? To investigate this relative gain, we propose studying the simple but canonical task of linear regression. Our analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor as large as $d$, where $d$ is the input dimension. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.

  Country:
  Genre: Research Report (0.64)
  Industry: Education (0.34)

The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

Wu, Jingfeng, Zou, Difan, Braverman, Vladimir, Gu, Quanquan, Kakade, Sham M.

arXiv.org Artificial Intelligence

In transfer learning (Pan and Yang, 2009; Sugiyama and Kawanabe, 2012), an algorithm is provided with abundant data from a source domain and scarce or no data from a target domain, and aims to train a model that generalizes well on the target domain. A simple yet effective approach is to pretrain a model with the rich source data and then finetune the model with the available target data via, e.g., stochastic gradient descent (SGD) (see, e.g., Yosinski et al. (2014)). Despite its wide applicability in practice, the power and limitation of the pretraining-finetuning based transfer learning framework is not fully understood in theory. The focus of this work is to consider this issue in a specific transfer learning setup known as covariate shift (Pan and Yang, 2009; Sugiyama and Kawanabe, 2012), where the source and target distributions differ in their marginal distributions over the input, but coincide in their conditional distribution of the output given the input. Regarding the theory of learning with covariate shift, there exists a rich set of results (Ben-David et al., 2010; Germain et al., 2013; Mansour et al., 2009; Mohri and Muñoz Medina, 2012; Cortes and