Fagan, Francois
Robust Implicit Backpropagation
Fagan, Francois, Iyengar, Garud
Arguably the biggest challenge in applying neural networks is tuning the hyperparameters, in particular the learning rate. The sensitivity to the learning rate is due to the reliance on backpropagation to train the network. In this paper we present the first application of Implicit Stochastic Gradient Descent (ISGD) to train neural networks, a method known in convex optimization to be unconditionally stable and robust to the learning rate. Our key contribution is a novel layer-wise approximation of ISGD which makes its updates tractable for neural networks. Experiments show that our method is more robust to high learning rates and generally outperforms standard backpropagation on a variety of tasks.
Unbiased scalable softmax optimization
Fagan, Francois, Iyengar, Garud
Recent neural network and language models rely on softmax distributions with an extremely large number of categories. Since calculating the softmax normalizing constant in this context is prohibitively expensive, there is a growing literature of efficiently computable but biased estimates of the softmax. In this paper we propose the first unbiased algorithms for maximizing the softmax likelihood whose work per iteration is independent of the number of classes and datapoints (and no extra work is required at the end of each epoch). We show that our proposed unbiased methods comprehensively outperform the state-of-the-art on seven real world datasets.
TripleSpin - a generic compact paradigm for fast machine learning computations
Choromanski, Krzysztof, Fagan, Francois, Gouy-Pailler, Cedric, Morvan, Anne, Sarlos, Tamas, Atif, Jamal
We present a generic compact computational framework relying on structured random matrices that can be applied to speed up several machine learning algorithms with almost no loss of accuracy. The applications include new fast LSH-based algorithms, efficient kernel computations via random feature maps, convex optimization algorithms, quantization techniques and many more. Certain models of the presented paradigm are even more compressible since they apply only bit matrices. This makes them suitable for deploying on mobile devices. All our findings come with strong theoretical guarantees. In particular, as a byproduct of the presented techniques and by using relatively new Berry-Esseen-type CLT for random vectors, we give the first theoretical guarantees for one of the most efficient existing LSH algorithms based on the $\textbf{HD}_{3}\textbf{HD}_{2}\textbf{HD}_{1}$ structured matrix ("Practical and Optimal LSH for Angular Distance"). These guarantees as well as theoretical results for other aforementioned applications follow from the same general theoretical principle that we present in the paper. Our structured family contains as special cases all previously considered structured schemes, including the recently introduced $P$-model. Experimental evaluation confirms the accuracy and efficiency of TripleSpin matrices.
Fast nonlinear embeddings via structured matrices
Choromanski, Krzysztof, Fagan, Francois
We present a new paradigm for speeding up randomized computations of several frequently used functions in machine learning. In particular, our paradigm can be applied for improving computations of kernels based on random embeddings. Above that, the presented framework covers multivariate randomized functions. As a byproduct, we propose an algorithmic approach that also leads to a significant reduction of space complexity. Our method is based on careful recycling of Gaussian vectors into structured matrices that share properties of fully random matrices. The quality of the proposed structured approach follows from combinatorial properties of the graphs encoding correlations between rows of these structured matrices. Our framework covers as special cases already known structured approaches such as the Fast Johnson-Lindenstrauss Transform, but is much more general since it can be applied also to highly nonlinear embeddings. We provide strong concentration results showing the quality of the presented paradigm.