to

Escaping Saddle Points in Constrained Optimization

In this paper, we study the problem of escaping from saddle points in smooth nonconvex optimization problems subject to a convex set $\mathcal{C}$. We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set $\mathcal{C}$ is simple for a quadratic objective function. Specifically, our results hold if one can find a $\rho$-approximate solution of a quadratic program subject to $\mathcal{C}$ in polynomial time, where $\rho 1$ is a positive constant that depends on the structure of the set $\mathcal{C}$. Under this condition, we show that the sequence of iterates generated by the proposed framework reaches an $(\epsilon,\gamma)$-second order stationary point (SOSP) in at most $\mathcal{O}(\max\{\epsilon {-2},\rho {-3}\gamma {-3}\})$ iterations. We further characterize the overall complexity of reaching an SOSP when the convex set $\mathcal{C}$ can be written as a set of quadratic constraints and the objective function Hessian has a specific structure over the convex $\mathcal{C}$.

Loss function for Logistic Regression

If we are doing a binary classification using logistic regression, we often use the cross entropy function as our loss function. Question: However, if we are doing linear regression, we often use squared-error as our loss function. Are there any specific reasons for using the cross entropy function instead of using squared-error or the classification error in logistic regression? I read somewhere that, if we use squared-error for binary classification, the resulting loss function would be non-convex. Is this the only reason reason, or is there any other deeper reason which I am missing?

AI Notes: Parameter optimization in neural networks - deeplearning.ai

In machine learning, you start by defining a task and a model. The model consists of an architecture and parameters. For a given architecture, the values of the parameters determine how accurately the model performs the task. But how do you find good values? By defining a loss function that evaluates how well the model performs.

#005A Logistic Regression from scratch Master Data Science

In this post we will talk about applying gradient descent on $$m$$ training examples. Now the question is how we can define what gradient descent is? A gradient descent is an efficient optimization algorithm that attempts to find a global minimum of a function. It also enables a model to calculate the gradient or direction that the model should take to reduce errors (differences between actual $$y$$ and predicted $$\hat{y}$$). Now let's remind ourselves what the cost function is?

Solving Non-smooth Constrained Programs with Lower Complexity than \mathcal{O}(1/\varepsilon): A Primal-Dual Homotopy Smoothing Approach

We propose a new primal-dual homotopy smoothing algorithm for a linearly constrained convex program, where neither the primal nor the dual function has to be smooth or strongly convex. The best known iteration complexity solving such a non-smooth problem is $\mathcal{O}(\varepsilon {-1})$. This result improves upon the $\mathcal{O}(\varepsilon {-1})$ convergence time bound achieved by existing distributed optimization algorithms. Simulation experiments also demonstrate the performance of our proposed algorithm. Papers published at the Neural Information Processing Systems Conference.