Goto

Collaborating Authors

 Optimization


A Robust Multi-Batch L-BFGS Method for Machine Learning

arXiv.org Machine Learning

This paper describes an implementation of the L-BFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multi-batch approach in which the data points used to compute function and gradients are purposely changed at each iteration to accelerate the learning process. Difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the updating process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, studies the convergence properties for both convex and nonconvex functions, and illustrates the behavior of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning.


Stochastic Alternating Direction Method of Multipliers with Variance Reduction for Nonconvex Optimization

arXiv.org Machine Learning

In the paper, we study the stochastic alternating direction method of multipliers (ADMM) for the nonconvex optimizations, and propose three classes of the nonconvex stochastic ADMM with variance reduction, based on different reduced variance stochastic gradients. Specifically, the first class called the nonconvex stochastic variance reduced gradient ADMM (SVRG-ADMM), uses a multi-stage scheme to progressively reduce the variance of stochastic gradients. The second is the nonconvex stochastic average gradient ADMM (SAG-ADMM), which additionally uses the old gradients estimated in the previous iteration. The third called SAGA-ADMM is an extension of the SAG-ADMM method. Moreover, under some mild conditions, we establish the iteration complexity bound of $O(1/\epsilon)$ of the proposed methods to obtain an $\epsilon$-stationary solution of the nonconvex optimizations. In particular, we provide a general framework to analyze the iteration complexity of these nonconvex stochastic ADMM methods with variance reduction. Finally, some numerical experiments demonstrate the effectiveness of our methods.


Accelerated Stochastic Mirror Descent Algorithms For Composite Non-strongly Convex Optimization

arXiv.org Machine Learning

We consider the problem of minimizing the sum of an average function of a large number of smooth convex components and a general, possibly non-differentiable, convex function. Although many methods have been proposed to solve this problem with the assumption that the sum is strongly convex, few methods support the non-strongly convex cases. Adding a small quadratic regularization is a common trick used to tackle non-strongly convex problems; however, it may worsen the quality of solutions or weaken the performance of the algorithms. Avoiding this trick, we propose a new accelerated stochastic mirror descent method for solving the problem without the strongly convex assumption. Our method extends the deterministic accelerated proximal gradient methods of Paul Tseng and can be applied even when proximal points are computed inexactly. Our direct algorithms can be proven to achieve the optimal convergence rate $O(\frac{1}{k^2})$ under a suitable choice of the errors in calculating the proximal points. We also propose a scheme for solving the problem when the component functions are non-smooth and finally apply the new algorithms to a class of composite convex concave optimization problems.


High-Dimensional Regression with Binary Coefficients. Estimating Squared Error and a Phase Transition

arXiv.org Machine Learning

We consider a sparse linear regression model Y=X\beta^{*}+W where X has a Gaussian entries, W is the noise vector with mean zero Gaussian entries, and \beta^{*} is a binary vector with support size (sparsity) k. Using a novel conditional second moment method we obtain a tight up to a multiplicative constant approximation of the optimal squared error \min_{\beta}\|Y-X\beta\|_{2}, where the minimization is over all k-sparse binary vectors \beta. The approximation reveals interesting structural properties of the underlying regression problem. In particular, a) We establish that n^*=2k\log p/\log (2k/\sigma^{2}+1) is a phase transition point with the following "all-or-nothing" property. When n exceeds n^{*}, (2k)^{-1}\|\beta_{2}-\beta^*\|_0\approx 0, and when n is below n^{*}, (2k)^{-1}\|\beta_{2}-\beta^*\|_0\approx 1, where \beta_2 is the optimal solution achieving the smallest squared error. With this we prove that n^{*} is the asymptotic threshold for recovering \beta^* information theoretically. b) We compute the squared error for an intermediate problem \min_{\beta}\|Y-X\beta\|_{2} where minimization is restricted to vectors \beta with \|\beta-\beta^{*}\|_0=2k \zeta, for \zeta\in [0,1]. We show that a lower bound part \Gamma(\zeta) of the estimate, which corresponds to the estimate based on the first moment method, undergoes a phase transition at three different thresholds, namely n_{\text{inf,1}}=\sigma^2\log p, which is information theoretic bound for recovering \beta^* when k=1 and \sigma is large, then at n^{*} and finally at n_{\text{LASSO/CS}}. c) We establish a certain Overlap Gap Property (OGP) on the space of all binary vectors \beta when n\le ck\log p for sufficiently small constant c. We conjecture that OGP is the source of algorithmic hardness of solving the minimization problem \min_{\beta}\|Y-X\beta\|_{2} in the regime n


Towards Decoding as Continuous Optimization in Neural Machine Translation

arXiv.org Artificial Intelligence

We propose a novel decoding approach for neural machine translation (NMT) based on continuous optimisation. We convert decoding - basically a discrete optimization problem - into a continuous optimization problem. The resulting constrained continuous optimisation problem is then tackled using gradient-based methods. Our powerful decoding framework enables decoding intractable models such as the intersection of left-to-right and right-to-left (bidirectional) as well as source-to-target and target-to-source (bilingual) NMT models. Our empirical results show that our decoding framework is effective, and leads to substantial improvements in translations generated from the intersected models where the typical greedy or beam search is not feasible. We also compare our framework against reranking, and analyse its advantages and disadvantages.


Fast Algorithms for Demixing Sparse Signals from Nonlinear Observations

arXiv.org Machine Learning

We study the problem of demixing a pair of sparse signals from noisy, nonlinear observations of their superposition. Mathematically, we consider a nonlinear signal observation model, $y_i = g(a_i^Tx) + e_i, \ i=1,\ldots,m$, where $x = \Phi w+\Psi z$ denotes the superposition signal, $\Phi$ and $\Psi$ are orthonormal bases in $\mathbb{R}^n$, and $w, z\in\mathbb{R}^n$ are sparse coefficient vectors of the constituent signals, and $e_i$ represents the noise. Moreover, $g$ represents a nonlinear link function, and $a_i\in\mathbb{R}^n$ is the $i$-th row of the measurement matrix, $A\in\mathbb{R}^{m\times n}$. Problems of this nature arise in several applications ranging from astronomy, computer vision, and machine learning. In this paper, we make some concrete algorithmic progress for the above demixing problem. Specifically, we consider two scenarios: (i) the case when the demixing procedure has no knowledge of the link function, and (ii) the case when the demixing algorithm has perfect knowledge of the link function. In both cases, we provide fast algorithms for recovery of the constituents $w$ and $z$ from the observations. Moreover, we support these algorithms with a rigorous theoretical analysis, and derive (nearly) tight upper bounds on the sample complexity of the proposed algorithms for achieving stable recovery of the component signals. We also provide a range of numerical simulations to illustrate the performance of the proposed algorithms on both real and synthetic signals and images.


The inflation technique solves completely the classical inference problem

arXiv.org Machine Learning

The causal inference problem consists in determining whether a probability distribution over a set of observed variables is compatible with a given causal structure. In [arXiv:1609.00672], one of us introduced a hierarchy of necessary linear programming constraints which all the observed distributions compatible with the considered causal structure must satisfy. In this work, we prove that the inflation hierarchy is complete, i.e., any distribution of the observed variables which does not admit a realization within the considered causal structure will fail one of the inflation tests. More quantitatively, we show that any distribution of measurable events satisfying the $n^{th}$ inflation test is $O\left(\frac{1}{\sqrt{n}}\right)$-close in Euclidean norm to a distribution realizable within the given causal structure. In addition, we show that the corresponding $n^{th}$-order relaxation of the dual problem consisting in maximizing a $k^{th}$ degree polynomial on the observed variables is $O\left(\frac{k^2}{n}\right)$-close to the optimal solution.


Sparsity by Worst-Case Penalties

arXiv.org Machine Learning

This paper proposes a new interpretation of sparse penalties such as the elastic-net and the group-lasso. Beyond providing a new viewpoint on these penalization schemes, our approach results in a unified optimization strategy. Our experiments demonstrate that this strategy, implemented on the elastic-net, is computationally extremely efficient for small to medium size problems. Our accompanying software solves problems very accurately, at machine precision, in the time required to get a rough estimate with competing state-of-the-art algorithms. We illustrate on real and artificial datasets that this accuracy is required to for the correctness of the support of the solution, which is an important element for the interpretability of sparsity-inducing penalties.


Adaptive ADMM with Spectral Penalty Parameter Selection

arXiv.org Artificial Intelligence

The alternating direction method of multipliers (ADMM) is a versatile tool for solving a wide range of constrained optimization problems, with differentiable or non-differentiable objective functions. Unfortunately, its performance is highly sensitive to a penalty parameter, which makes ADMM often unreliable and hard to automate for a non-expert user. We tackle this weakness of ADMM by proposing a method to adaptively tune the penalty parameters to achieve fast convergence. The resulting adaptive ADMM (AADMM) algorithm, inspired by the successful Barzilai-Borwein spectral method for gradient descent, yields fast convergence and relative insensitivity to the initial stepsize and problem scaling.


Adopting the Cascade Model in Ad Auctions: Efficiency Bounds and Truthful Algorithmic Mechanisms

Journal of Artificial Intelligence Research

Sponsored Search Auctions (SSAs) are one of the most successful applications of microeconomic mechanisms, with a revenue of about $72 billion in the US alone in 2016. However, the problem of designing the best economic mechanism for sponsored search auctions is far from being solved, and, given the amount at stake, it is no surprise that it has received growing attention over the past few years. The most common auction mechanism for SSAs is the Generalized Second Price (GSP). However, the GSP is known not to be truthful: the agents participating in the auction might have an incentive to report false values, generating economic inefficiency and suboptimal revenues in turn. Superior, efficient truthful mechanisms, such as the Vickrey-Clarke-Groves (VCG) auction, are well known in the literature. However, while the VCG auction is currently adopted for the strictly related scenario of contextual advertising, e.g., by Google and Facebook, companies are reluctant to extend it to SSAs, fearing prohibitive switching costs. Other than truthfulness, two issues are of paramount importance in designing effective SSAs. First, the choice of the user model; not only does an accurate user model better target ads to users, it also is a critical factor in reducing the inefficiency of the mechanism. Often an antagonist to this, the second issue is the running time of the mechanism, given the performance pressure these mechanisms undertake in real-world applications. In our work, we argue in favor of adopting the VCG mechanism based on the cascade model with ad/position externalities (APDC-VCG). Our study includes both the derivation of inefficiency bounds and the design and the experimental evaluation of exact and approximate algorithms.