Towards the Fundamental Limits of Knowledge Transfer over Finite Domains

Zhao, Qingyue, Zhu, Banghua

arXiv.org Machine Learning 

It has become common sense that transferring intrinsic information from teachers to the greatest extent can expedite a student's learning progress, especially in machine learning given versatile and powerful teacher models. Learning with their assistance has been coined knowledge distillation (KD) (Hinton et al., 2015; Lopez-Paz et al., 2015), a famous paradigm of knowledge transfer leading to remarkable empirical effectiveness in classification tasks across various downstream applications (Gou et al., 2021; Wang and Yoon, 2021; Gu et al., 2023b). The term distillation implies a belief that the inscrutable teacher(s) may possess useful yet complicated structural information, which we should be able to compress and inject into a compact one, i.e., the student model (Breiman and Shang, 1996; Buciluǎ et al., 2006; Li et al., 2014; Ba and Caruana, 2014; Allen-Zhu and Li, 2020). This has guided the community towards a line of knowledge transfer methods featuring the awareness of teacher training details or snapshots, such as the original training set, the intermediate activations, the last-layer logits (for a probabilistic classifier), the first-or second-order derivative or statistical information, and even task-specific knowledge (Hinton et al., 2015; Furlanello et al., 2018; Cho and Hariharan, 2019; Zhao et al., 2022; Romero et al., 2014; Zagoruyko and Komodakis, 2016;

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found