Optimization


Importance of Loss Function in Machine Learning

#artificialintelligence

Assume you are given a task to fill a bag with 10 Kg of sand. You fill it up till the measuring machine gives you a perfect reading of 10 Kg or you take out the sand if the reading exceeds 10kg. Just like that weighing machine, if your predictions are off, your loss function will output a higher number. As you experiment with your algorithm to try and improve your model, your loss function will tell you if you're getting(or reaching) anywhere. "The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function" - Source At its core, a loss function is a measure of how good your prediction model does in terms of being able to predict the expected outcome(or value).


Inside the Mind and Methodology of a Data Scientist - Birst

#artificialintelligence

When you hear about Data Science, Big Data, Analytics, Artificial Intelligence, Machine Learning, or Deep Learning, you may end up feeling a bit confused about what these terms mean. And it doesn't help reduce the confusion when every tech vendor rebrands their products as AI. So, what do these terms really mean? What are overlaps and differences? And most importantly, what can this do for your business?


Essential Math for Data Science Machine Learning Analytikus United States

#artificialintelligence

The full article also features courses that you could attend to learn the topics listed below, as well as numerous comments. We also added a few topics that we think are important and missing in the original article.



Heavy-ball Algorithms Always Escape Saddle Points

arXiv.org Machine Learning

Nonconvex optimization algorithms with random initialization have attracted increasing attention recently. It has been showed that many first-order methods always avoid saddle points with random starting points. In this paper, we answer a question: can the nonconvex heavy-ball algorithms with random initialization avoid saddle points? The answer is yes! Direct using the existing proof technique for the heavy-ball algorithms is hard due to that each iteration of the heavy-ball algorithm consists of current and last points. It is impossible to formulate the algorithms as iteration like xk+1= g(xk) under some mapping g. To this end, we design a new mapping on a new space. With some transfers, the heavy-ball algorithm can be interpreted as iterations after this mapping. Theoretically, we prove that heavy-ball gradient descent enjoys larger stepsize than the gradient descent to escape saddle points to escape the saddle point. And the heavy-ball proximal point algorithm is also considered; we also proved that the algorithm can always escape the saddle point.


How Uber Eats Uses Machine Learning to Estimate Delivery Times - The New Stack

#artificialintelligence

Estimating the perfect times for drivers to pick up food delivery orders for a number of different restaurants can be one of the most difficult of computational problems. Think of it like the Traveling Salesperson NP-Hard combinatorial optimization problem: The customer wants food delivered in a timely manner, and the delivery person wants to food ready when they roll-up. If the estimates are off by even a tiny bit, then customers are unhappy and delivery people will work elsewhere. Yet, car-sharing service Uber is building a global service, called Uber Eats, that will rely on accurate predictions to succeed. The secret to its success will be machine learning, built from the company's in-house ML platform, nicknamed Michelangelo.


Direction Matters: On Influence-Preserving Graph Summarization and Max-cut Principle for Directed Graphs

arXiv.org Machine Learning

Summarizing large-scaled directed graphs into small-scale representations is a useful but less studied problem setting. Conventional clustering approaches, which based on "Min-Cut"-style criteria, compress both the vertices and edges of the graph into the communities, that lead to a loss of directed edge information. On the other hand, compressing the vertices while preserving the directed edge information provides a way to learn the small-scale representation of a directed graph. The reconstruction error, which measures the edge information preserved by the summarized graph, can be used to learn such representation. Compared to the original graphs, the summarized graphs are easier to analyze and are capable of extracting group-level features which is useful for efficient interventions of population behavior. In this paper, we present a model, based on minimizing reconstruction error with non-negative constraints, which relates to a "Max-Cut" criterion that simultaneously identifies the compressed nodes and the directed compressed relations between these nodes. A multiplicative update algorithm with column-wise normalization is proposed. We further provide theoretical results on the identifiability of the model and on the convergence of the proposed algorithms. Experiments are conducted to demonstrate the accuracy and robustness of the proposed method.


Practical Newton-Type Distributed Learning using Gradient Based Approximations

arXiv.org Machine Learning

We study distributed algorithms for expected loss minimization where the datasets are large and have to be stored on different machines. Often we deal with minimizing the average of a set of convex functions where each function is the empirical risk of the corresponding part of the data. In the distributed setting where the individual data instances can be accessed only on the local machines, there would be a series of rounds of local computations followed by some communication among the machines. Since the cost of the communication is usually higher than the local machine computations, it is important to reduce it as much as possible. However, we should not allow this to make the computation too expensive to become a burden in practice. Using second-order methods could make the algorithms converge faster and decrease the amount of communication needed. There are some successful attempts in developing distributed second-order methods. Although these methods have shown fast convergence, their local computation is expensive and could enjoy more improvement for practical uses. In this study we modify an existing approach, DANE (Distributed Approximate NEwton), in order to improve the computational cost while maintaining the accuracy. We tackle this problem by using iterative methods for solving the local subproblems approximately instead of providing exact solutions for each round of communication. We study how using different iterative methods affect the behavior of the algorithm and try to provide an appropriate tradeoff between the amount of local computation and the required amount of communication. We demonstrate the practicality of our algorithm and compare it to the existing distributed gradient based methods such as SGD.


Stochastic algorithms with geometric step decay converge linearly on sharp functions

arXiv.org Machine Learning

Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of \emph{sharp} convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp nonconvex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks---phase retrieval and blind deconvolution---and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions.


Distributed Inexact Successive Convex Approximation ADMM: Analysis-Part I

arXiv.org Machine Learning

In this two-part work, we propose an algorithmic framework for solving non-convex problems whose objective function is the sum of a number of smooth component functions plus a convex (possibly non-smooth) or/and smooth (possibly non-convex) regularization function. The proposed algorithm incorporates ideas from several existing approaches such as alternate direction method of multipliers (ADMM), successive convex approximation (SCA), distributed and asynchronous algorithms, and inexact gradient methods. Different from a number of existing approaches, however, the proposed framework is flexible enough to incorporate a class of non-convex objective functions, allow distributed operation with and without a fusion center, and include variance reduced methods as special cases. Remarkably, the proposed algorithms are robust to uncertainties arising from random, deterministic, and adversarial sources. The part I of the paper develops two variants of the algorithm under very mild assumptions and establishes first-order convergence rate guarantees. The proof developed here allows for generic errors and delays, paving the way for different variance-reduced, asynchronous, and stochastic implementations, outlined and evaluated in part II.