Goto

Collaborating Authors

 Mathematical & Statistical Methods


Jeff Goldblum should make a film about this legendary mathematician

New Scientist

Paul Erdล‘s was one of the most prolific mathematicians to ever live, known for showing up at the door of others in the field and declaring they should host and feed him while they do maths together. I come to you with something a little different for my latest maths column - a plea to Hollywood to make a comedy biopic about one of the greatest mathematicians of all time, Paul Erdล‘s. Why is Erdล‘s (pronounced "air-dish") deserving of such acclaim? With almost 1500 papers to his name, he is probably the most prolific mathematician that ever lived, and possibly that will ever live. Unsurprisingly, with that many papers, he is known for his work across many areas of maths, from probability to number theory to graph theory.




EscapingSaddle-PointFasterunder Interpolation-likeConditions

Neural Information Processing Systems

One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an overparametrization setting, thefirst-order oracle complexityofPerturbed Stochastic Gradient Descent (PSGD) algorithm toreach an -local-minimizer,matches the corresponding deterministic rateof O(1/2).




Escaping Saddle Points with Compressed SGD

Neural Information Processing Systems

Stochastic Gradient Descent (SGD) and its variants are the main workhorses of modern machine learning. Distributed implementations of SGD on a cluster of machines with a central server and a large number of workers are frequently used in practice due to the massive size of the data. In distributed SGD each machine holds a copy of the model and the computation proceeds in rounds. In every round, each worker finds a stochastic gradient based on its batch of examples, the server averages these stochastic gradients to obtain the gradient of the entire batch, makes an SGD step, and broadcasts the updated model parameters to the workers. With a large number of workers, computation parallelizes efficiently while communication becomes the main bottleneck [Chilimbi et al., 2014, Strom, 2015], since each worker needsto send its gradients to the server and receive the updatedmodel parameters. Commonsolutions for this probleminclude: local SGDand its variants, when each machine performs multiple local steps before communication [Stich, 2018]; decentralized architectureswhich allow pairwisecommunicationbetween the workers [McMahanet al., 2017] and gradient compression, when a compressed version of the gradient is communicated instead of the full gradient [Bernstein et al., 2018, Stich et al., 2018, Karimireddy et al., 2019]. In this work, we consider the latter approach, which we refer to as compressed SGD. Most machine learning models can be described by a d-dimensional vector of parameters x and themodel quality canbe estimatedas a function f(x).