Optimization
Fast Low-Rank Matrix Learning with Nonconvex Regularization
Yao, Quanming, Kwok, James T., Zhong, Wenliang
Low-rank modeling has a lot of important applications in machine learning, computer vision and social network analysis. While the matrix rank is often approximated by the convex nuclear norm, the use of nonconvex low-rank regularizers has demonstrated better recovery performance. However, the resultant optimization problem is much more challenging. A very recent state-of-the-art is based on the proximal gradient algorithm. However, it requires an expensive full SVD in each proximal step. In this paper, we show that for many commonly-used nonconvex low-rank regularizers, a cutoff can be derived to automatically threshold the singular values obtained from the proximal operator. This allows the use of power method to approximate the SVD efficiently. Besides, the proximal operator can be reduced to that of a much smaller matrix projected onto this leading subspace. Convergence, with a rate of O(1/T) where T is the number of iterations, can be guaranteed. Extensive experiments are performed on matrix completion and robust principal component analysis. The proposed method achieves significant speedup over the state-of-the-art. Moreover, the matrix solution obtained is more accurate and has a lower rank than that of the traditional nuclear norm regularizer.
A simulation-based genetic algorithm approach for remanufacturing process planning and scheduling
We consider integrated process planning and scheduling for remanufacturing. Two potentially conflicting objective functions are considered simultaneously. A simulation-based genetic algorithm approach is developed. Key parameters of the algorithm have been fine-tuned. Extensive computational experiments and evaluations have been performed. Remanufacturing has attracted growing attention in recent years because of its energy-saving and emission-reduction potential.
Convergence rates of sub-sampled Newton methods
Erdogdu, Murat A., Montanari, Andrea
We consider the problem of minimizing a sum of $n$ functions over a convex parameter set $\mathcal{C} \subset \mathbb{R}^p$ where $n\gg p\gg 1$. In this regime, algorithms which utilize sub-sampling techniques are known to be effective. In this paper, we use sub-sampling techniques together with low-rank approximation to design a new randomized batch algorithm which possesses comparable convergence rate to Newton's method, yet has much smaller per-iteration cost. The proposed algorithm is robust in terms of starting point and step size, and enjoys a composite convergence rate, namely, quadratic convergence at start and linear convergence when the iterate is close to the minimizer. We develop its theoretical analysis which also allows us to select near-optimal algorithm parameters. Our theoretical results can be used to obtain convergence rates of previously proposed sub-sampling based algorithms as well. We demonstrate how our results apply to well-known machine learning problems. Lastly, we evaluate the performance of our algorithm on several datasets under various scenarios.
Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method
We consider the problem of minimizing a Lipschitz differentiable function over a class of sparse symmetric sets that has wide applications in engineering and science. For this problem, it is known that any accumulation point of the classical projected gradient (PG) method with a constant stepsize $1/L$ satisfies the $L$-stationarity optimality condition that was introduced in [3]. In this paper we introduce a new optimality condition that is stronger than the $L$-stationarity optimality condition. We also propose a nonmonotone projected gradient (NPG) method for this problem by incorporating some support-changing and coordintate-swapping strategies into a projected gradient method with variable stepsizes. It is shown that any accumulation point of NPG satisfies the new optimality condition and moreover it is a coordinatewise stationary point. Under some suitable assumptions, we further show that it is a global or a local minimizer of the problem. Numerical experiments are conducted to compare the performance of PG and NPG. The computational results demonstrate that NPG has substantially better solution quality than PG, and moreover, it is at least comparable to, but sometimes can be much faster than PG in terms of speed.
Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators
Zhang, Yuchen, Wainwright, Martin J., Jordan, Michael I.
For the problem of high-dimensional sparse linear regression, it is known that an $\ell_0$-based estimator can achieve a $1/n$ "fast" rate on the prediction error without any conditions on the design matrix, whereas in absence of restrictive conditions on the design matrix, popular polynomial-time methods only guarantee the $1/\sqrt{n}$ "slow" rate. In this paper, we show that the slow rate is intrinsic to a broad class of M-estimators. In particular, for estimators based on minimizing a least-squares cost function together with a (possibly non-convex) coordinate-wise separable regularizer, there is always a "bad" local optimum such that the associated prediction error is lower bounded by a constant multiple of $1/\sqrt{n}$. For convex regularizers, this lower bound applies to all global optima. The theory is applicable to many popular estimators, including convex $\ell_1$-based methods as well as M-estimators based on nonconvex regularizers, including the SCAD penalty or the MCP regularizer. In addition, for a broad class of nonconvex regularizers, we show that the bad local optima are very common, in that a broad class of local minimization algorithms with random initialization will typically converge to a bad solution.
Newton-Stein Method: An optimization method for GLMs via Stein's Lemma
We consider the problem of efficiently computing the maximum likelihood estimator in Generalized Linear Models (GLMs) when the number of observations is much larger than the number of coefficients ($n \gg p \gg 1$). In this regime, optimization algorithms can immensely benefit from approximate second order information. We propose an alternative way of constructing the curvature information by formulating it as an estimation problem and applying a Stein-type lemma, which allows further improvements through sub-sampling and eigenvalue thresholding. Our algorithm enjoys fast convergence rates, resembling that of second order methods, with modest per-iteration cost. We provide its convergence analysis for the general case where the rows of the design matrix are samples from a sub-gaussian distribution. We show that the convergence has two phases, a quadratic phase followed by a linear phase. Finally, we empirically demonstrate that our algorithm achieves the highest performance compared to various algorithms on several datasets.
Reinforcement Learning with Parameterized Actions
Masson, Warwick, Ranchod, Pravesh, Konidaris, George
We introduce a model-free algorithm for learning in Markov decision processes with parameterized actions--discrete actions with continuous parameters. At each step the agent must select both which action to use and which parameters to use with that action. We introduce the Q-PAMDP algorithm for learning in these domains, show that it converges to a local optimum, and compare it to direct policy search in the goalscoring and Platform domains.
Barrier Frank-Wolfe for Marginal Inference
Krishnan, Rahul G., Lacoste-Julien, Simon, Sontag, David
We introduce a globally-convergent algorithm for optimizing the tree-reweighted (TRW) variational objective over the marginal polytope. The algorithm is based on the conditional gradient method (Frank-Wolfe) and moves pseudomarginals within the marginal polytope through repeated maximum a posteriori (MAP) calls. This modular structure enables us to leverage black-box MAP solvers (both exact and approximate) for variational inference, and obtains more accurate results than tree-reweighted algorithms that optimize over the local consistency relaxation. Theoretically, we bound the sub-optimality for the proposed algorithm despite the TRW objective having unbounded gradients at the boundary of the marginal polytope. Empirically, we demonstrate the increased quality of results found by tightening the relaxation over the marginal polytope as well as the spanning tree polytope on synthetic and real-world instances.
Sparse Recovery via Partial Regularization: Models, Theory and Algorithms
In the context of sparse recovery, it is known that most of existing regularizers such as $\ell_1$ suffer from some bias incurred by some leading entries (in magnitude) of the associated vector. To neutralize this bias, we propose a class of models with partial regularizers for recovering a sparse solution of a linear system. We show that every local minimizer of these models is sufficiently sparse or the magnitude of all its nonzero entries is above a uniform constant depending only on the data of the linear system. Moreover, for a class of partial regularizers, any global minimizer of these models is a sparsest solution to the linear system. We also establish some sufficient conditions for local or global recovery of the sparsest solution to the linear system, among which one of the conditions is weaker than the best known restricted isometry property (RIP) condition for sparse recovery by $\ell_1$. In addition, a first-order feasible augmented Lagrangian (FAL) method is proposed for solving these models, in which each subproblem is solved by a nonmonotone proximal gradient (NPG) method. Despite the complication of the partial regularizers, we show that each proximal subproblem in NPG can be solved as a certain number of one-dimensional optimization problems, which usually have a closed-form solution. We also show that any accumulation point of the sequence generated by FAL is a first-order stationary point of the models. Numerical results on compressed sensing and sparse logistic regression demonstrate that the proposed models substantially outperform the widely used ones in the literature in terms of solution quality.
Parallel Predictive Entropy Search for Batch Global Optimization of Expensive Objective Functions
Shah, Amar, Ghahramani, Zoubin
We develop parallel predictive entropy search (PPES), a novel algorithm for Bayesian optimization of expensive black-box objective functions. At each iteration, PPES aims to select a batch of points which will maximize the information gain about the global maximizer of the objective. Well known strategies exist for suggesting a single evaluation point based on previous observations, while far fewer are known for selecting batches of points to evaluate in parallel. The few batch selection schemes that have been studied all resort to greedy methods to compute an optimal batch. To the best of our knowledge, PPES is the first non-greedy batch Bayesian optimization strategy. We demonstrate the benefit of this approach in optimization performance on both synthetic and real world applications, including problems in machine learning, rocket science and robotics.