Yadav, Mohit
Kernel Interpolation with Sparse Grids
Yadav, Mohit, Sheldon, Daniel, Musco, Cameron
Structured kernel interpolation (SKI) accelerates Gaussian process (GP) inference by interpolating the kernel covariance function using a dense grid of inducing points, whose corresponding kernel matrix is highly structured and thus amenable to fast linear algebra. Unfortunately, SKI scales poorly in the dimension of the input points, since the dense grid size grows exponentially with the dimension. To mitigate this issue, we propose the use of sparse grids within the SKI framework. These grids enable accurate interpolation, but with a number of points growing more slowly with dimension. We contribute a novel nearly linear time matrix-vector multiplication algorithm for the sparse grid kernel matrix. Next, we describe how sparse grids can be combined with an efficient interpolation scheme based on simplices. With these changes, we demonstrate that SKI can be scaled to higher dimensions while maintaining accuracy.
Faster Kernel Interpolation for Gaussian Processes
Yadav, Mohit, Sheldon, Daniel, Musco, Cameron
A key challenge in scaling Gaussian Process (GP) regression to massive datasets is that exact inference requires computation with a dense n x n kernel matrix, where n is the number of data points. Significant work focuses on approximating the kernel matrix via interpolation using a smaller set of m inducing points. Structured kernel interpolation (SKI) is among the most scalable methods: by placing inducing points on a dense grid and using structured matrix algebra, SKI achieves per-iteration time of O(n + m log m) for approximate inference. This linear scaling in n enables inference for very large data sets; however the cost is per-iteration, which remains a limitation for extremely large n. We show that the SKI per-iteration time can be reduced to O(m log m) after a single O(n) time precomputation step by reframing SKI as solving a natural Bayesian linear regression problem with a fixed set of m compact basis functions. With per-iteration complexity independent of the dataset size n for a fixed grid, our method scales to truly massive data sets. We demonstrate speedups in practice for a wide range of m and n and apply the method to GP inference on a three-dimensional weather radar dataset with over 100 million points.
Regularization and Learning an Ensemble of RNNs by Decorrelating Representations
Yadav, Mohit (TCS Research New-Delhi) | Agarwal, Sakshi (IIT Kharagpur )
Recurrent Neural Networks (RNNs) and their variants (suchas LSTMs and GRUs) have been remarkably successful atmachine-learning tasks on diverse kinds of sequential data(e.g. text, time-series, etc.). However, training of RNNs con-tinue to be a challenge due to difficulties stemming from regu-larization and the highly non-convex optimizations involved.In this paper, we propose to regularize training of RNNs byencouraging higher decorrelation in the hidden representa-tions. The cost function is devised to minimize non-diagonalelements of the correlation matrix computed over the hid-den representations of RNNs, along with the usual trainingaccuracy term; thereby penalizing redundancy in the learnedmodel. Furthermore, we propose to utilize the idea of decor-relating representations in learning an ensemble of RNNs,in order to maximize diversity in the resulting models; thusenforcing every individual network of the ensemble to gainabilities that are complementary to the ensemble. Extensiveexperiments are presented on various datasets with differentarchitectures of RNNs. Results are offered for multiple tasksand show that the proposed methods yield a significant im-provement; when compared with the state-of-the-art methods.