risk bound
Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher
Knowledge distillation is a strategy of training a student network with guide of the soft output from a teacher network. It has been a successful method of model compression and knowledge transfer. However, currently knowledge distillation lacks a convincing theoretical understanding. On the other hand, recent finding on neural tangent kernel enables us to approximate a wide neural network with a linear model of the network's random features. In this paper, we theoretically analyze the knowledge distillation of a wide neural network. First we provide a transfer risk bound for the linearized model of the network. Then we propose a metric of the task's training difficulty, called data inefficiency. Based on this metric, we show that for a perfect teacher, a high ratio of teacher's soft labels can be beneficial. Finally, for the case of imperfect teacher, we find that hard labels can correct teacher's wrong prediction, which explains the practice of mixing hard and soft labels.
Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime
Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to provide an instance-dependent excess risk bound of multi-pass SGD for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.
Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures
Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this benign overfitting phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
Review for NeurIPS paper: Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher
Weaknesses: Cons: - The paper assumes the student network f is overparameterized. However, there is no definition for f. There is also no comment on why you choose an overparameterized f (other than making it easier to analyze). In practice, student networks are usually small, which is mentioned in your introduction. Then, why is it meaningful to look at an overparameterized student network? More importantly, the paper assumes convergence when f is overparameterized, which is in fact not guaranteed.
Review for NeurIPS paper: Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher
Knowledge distillation is not well understood and the reviewers agree that there's value in studying this topic. The results make wide network and NTK assumptions, which some reviewers (and this AC) eschew as unrealistic, however it's still exciting to see a theoretical step forward on such an opaque issue.
Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher
Knowledge distillation is a strategy of training a student network with guide of the soft output from a teacher network. It has been a successful method of model compression and knowledge transfer. However, currently knowledge distillation lacks a convincing theoretical understanding. On the other hand, recent finding on neural tangent kernel enables us to approximate a wide neural network with a linear model of the network's random features. In this paper, we theoretically analyze the knowledge distillation of a wide neural network.
Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime
Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to provide an instance-dependent excess risk bound of multi-pass SGD for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization.
Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures
Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
We provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes. These bounds make short work of providing a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either L_2 or L_1 constraints), margin bounds (including both L_2 and L_1 margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and L_2 covering numbers (with L_p norm constraints and relative entropy constraints). In addition to providing a unified analysis, the results herein provide some of the sharpest risk and margin bounds (improving upon a number of previous results). Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction (up to a constant factor of 2).
Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Under the prediction model of learning, a prediction strategy is presented with an i.i.d. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube--the one-inclusion graph; the key step is a d VC(F) bound on one-inclunion graph density. The first main result of this s /n -1 paper is a density bound of n d-1 ( d) d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d (n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization.