Unlocking Tuning-free Generalization: Minimizing the PAC-Bayes Bound with Trainable Priors

Zhang, Xitong, Ghosh, Avrajit, Liu, Guangliang, Wang, Rongrong

arXiv.org Machine Learning 

It is widely recognized that the generalization ability of neural networks can be greatly enhanced through carefully tuning the training procedure. The current state-of-the-art training approach involves utilizing stochastic gradient descent (SGD) or Adam optimization algorithms along with a combination of additional regularization techniques such as weight decay, dropout, or noise injection. Optimal generalization can only be achieved by tuning a multitude of hyper-parameters extensively, which can be time-consuming and necessitates the additional validation dataset. To address this issue, we present a nearly tuning-free PAC-Bayes training framework that requires no extra regularization. This framework achieves test performance comparable to that of SGD/Adam, even when the latter are optimized through a complete grid search and supplemented with additional regularization terms. To understand the underlying benefits of these strategies, numerous studies have focused on studying individual strategies. For instance, it has been shown that larger learning rates (Cohen et al., 2021; Barrett & Dherin, 2020), momentum (Ghosh et al., 2022), smaller batch sizes (Lee & Jang, 2022) and batch normalization (Luo et al., 2018) individually induce higher degrees of implicit regularization on the sharpness of the loss function, yielding better generalization. Additionally, the intensity of explicit regularization techniques such as weight decay (Loshchilov & Hutter, 2017), dropout (Wei et al., 2020), parameter noise injection (Neelakantan et al., 2015; Orvieto et al., 2022), label noise (Damian et al., 2021) can significantly affect generalization. Despite these observations and explanations, it's unclear why seeking optimal combinations of these regularizations is still crucial in practice. Adjusting the intensity of each regularization based on different scenarios can be a tedious job, especially when previous research has indicated that some techniques can conflict with each other (Li et al., 2019). We summarize this challenge for conventional training in (Q1). Alternatively, PAC-Bayes generalization bounds provide foundational insights into generalization in the absence of validation and testing data (Shawe-Taylor & Williamson, 1997). Jiang et al. (2019) further suggests that PAC-Bayes bounds are among the best for evaluating generalization capabilities. Although PAC-Bayes bounds were traditionally used only in the post-training stage for quality control (Vapnik, 1998; McAllester, 1999), the recent work (Dziugaite & Roy, 2017b) has opened the door to using these bounds during the training phase. They showed that one can directly train a network via optimizing the PAC-Bayes bound, a strategy we refer to as PAC-Bayes training, and obtain reasonable performances.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found