A More Unified Theory of Transfer Learning

Hanneke, Steve, Kpotufe, Samory

arXiv.org Artificial Intelligence 

Domain Adaptation or Transfer Learning refer generally to the problem of harnessing data from a source distribution P to improve prediction performance w.r.t. to a target distribution Q for which some or no data is available. This problem has been researched over the last few decades with a recent resurgence in interest driven by modern applications that are often characterized by a scarcity of perfect target data. A fundamental question in the theory of domain adaptation (and variant problems on distribution shifts) is how to measure the relatedness between source P and target Q distributions. Importantly, desired measures of relatedness should not only tightly capture the predictive informationP has onQ, but have to be practically useful: that is, either the measure can be estimated from data to facilitate algorithmic design, or more generally, it should somehow admit adaptive procedures, i.e., procedures whose performance is adaptive to the a priori unknown level of relatedness between P and Q. Many notions have been proposed over the last few decades, starting with the seminal works of Mansour et al. [2009], Ben-David et al. [2010] on refinements of total-variation for domain adaptation in classification, to more recent proposals for domain adaptation in regression, e.g., Wasserstein distances Redko et al. [2017], Shen et al. [2018], or measures relating covariance structures across P and Q as in Mousavi Kalan et al. [2020], Zhang et al. [2022b], Ge et al. [2023]. These various notions of relatedness appear hard to compare at first glance, leading to a disparate theory of domain adaptation at present with no unified set of principles. Interestingly as we show, upon closer look at the existing literature--whether in classification or regression--it turns out that in fact, many seemingly distinct measures of relatedness proposed in domain adaptation actually implicitly bound the same fundamental quantities: we refer to these quantities as weak and strong moduli of transfer, and they roughly measure how fast the Q-risk of predictors decrease as their P -risk decreases. These moduli always yield as tight or tighter rates of transfer than many existing notions, while also admitting adaptive procedures in general settings, as shown via a reduction to the existence of certain confidence sets for the prediction problem at hand. These reductions, while of a theoretical nature, yield insights on general adaptive transfer approaches that are less tied to specific measures of relatedness between source P and target Q.