Scalable Kernel Learning via the Discriminant Information
Al, Mert, Hou, Zejiang, Kung, Sun-Yuan
For commonly used kernels such as Gaussian, the gradient computations mainly consist of matrix products and linear system solutions, thus they can be sped up significantly with GPU-accelerated linear system solvers. For instance, our imple - mentation took less than 80 miliseconds to compute DI/KDI gradients on an nVidia P100 GPU with feature dimensionalities up to 2000 and batch sizes up to 4000 using Gaussian kernels on the 3 datasets considered. In common learning methodologies, where a linear predictor is trained in conjunction with a parametric non-line ar mapping, the overall objective is to minimize a loss functio n averaged over the entire training sample, i.e., to minimize the expected loss over a single empirical distribution. Since D I directly measures the loss of the best linear predictor on a batch, however, stochastic gradient methods have a differe nt interpretation when utilizing this objective. Since each m ini-batch represents a different empirical distribution, DI ba sed training instead aims to find a feature mapping that adapts to various empirical distributions, which can reduce overfitt ing analogous to how bagging can improve generalization [ 27 ].
Sep-23-2019