Goto

Collaborating Authors

 Education




SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk

Neural Information Processing Systems

In many learning problems, ranging from clustering to ranking through metric learning, empirical estimates of the risk functional consist of an average over tu-ples ( e.g., pairs or triplets) of observations, rather than over individual observations. In this paper, we focus on how to best implement a stochastic approximation approach to solve such risk minimization problems. We argue that in the large-scale setting, gradient estimates should be obtained by sampling tuples of data points with replacement ( incomplete U -statistics) instead of sampling data points without replacement ( complete U -statistics based on subsamples). We develop a theoretical framework accounting for the substantial impact of this strategy on the generalization ability of the prediction model returned by the Stochastic Gradient Descent (SGD) algorithm. It reveals that the method we promote achieves a much better trade-off between statistical accuracy and computational cost. Beyond the rate bound analysis, experiments on AUC maximization and metric learning provide strong empirical evidence of the superiority of the proposed approach.


Ensemble Distillation for Robust Model Fusion in Federated Learning Tao Lin

Neural Information Processing Systems

Federated Learning (FL) is a machine learning setting where many devices collab-oratively train a machine learning model while keeping the training data decentralized.


Ensemble Distillation for Robust Model Fusion in Federated Learning Tao Lin

Neural Information Processing Systems

Federated Learning (FL) is a machine learning setting where many devices collab-oratively train a machine learning model while keeping the training data decentralized.



A Appendix

Neural Information Processing Systems

We first give a derivation on the equivalence of label smoothing regularization and Eq. 7. Evidently, the objective does not regularize confidence diversity. "Scale both" corresponds to the originally proposed distillation objective in which both teacher and Plots of test accuracy and ECE against amount of temperature scaling applied are shown in Figure 1. Firstly, we observe that models trained with student scaling have ECE almost identical to that of the teacher models. As a direct contrast, we see that the student models trained without student scaling perform much better in terms of calibration error in general over its teacher. This coupled effect could be the reason for the observed conflict between ECE and accuracy.




Convergence rates of sub-sampled Newton methods

Neural Information Processing Systems

In this regime, algorithms which utilize sub-sampling techniques are known to be effective. In this paper, we use sub-sampling techniques together with low-rank approximation to design a new randomized batch algorithm which possesses comparable convergence rate to Newton's method, yet has much smaller per-iteration cost. The proposed algorithm is robust in terms of starting point and step size, and enjoys a composite convergence rate, namely, quadratic convergence at start and linear convergence when the iterate is close to the minimizer. We develop its theoretical analysis which also allows us to select near-optimal algorithm parameters. Our theoretical results can be used to obtain convergence rates of previously proposed sub-sampling based algorithms as well. We demonstrate how our results apply to well-known machine learning problems. Lastly, we evaluate the performance of our algorithm on several datasets under various scenarios.