Goto

Collaborating Authors

 conditional gradient algorithm





dea9ddb25cbf2352cf4dec30222a02a5-Paper.pdf

Neural Information Processing Systems

Factorization machines and polynomial networks are supervised polynomial models based on an efficient low-rank decomposition. We extend these models to the multi-output setting, i.e., for learning vector-valued functions, with application to multi-class or multi-task problems. We cast this as the problem of learning a 3-way tensor whose slices share a common basis and propose a convex formulation of that problem. We then develop an efficient conditional gradient algorithm and prove its global convergence, despite the fact that it involves a non-convex basis selection step. On classification tasks, we show that our algorithm achieves excellent accuracy with much sparser models than existing methods. On recommendation system tasks, we show how to combine our algorithm with a reduction from ordinal regression to multi-output classification and show that the resulting algorithm outperforms simple baselines in terms of ranking accuracy.


Semi-Proximal Mirror-Prox for Nonsmooth Composite Minimization

Neural Information Processing Systems

We propose a new first-order optimization algorithm to solve high-dimensional non-smooth composite minimization problems. Typical examples of such problems have an objective that decomposes into a non-smooth empirical risk part and a non-smooth regularization penalty. The proposed algorithm, called Semi-Proximal Mirror-Prox, leverages the saddle point representation of one part of the objective while handling the other part of the objective via linear minimization over the domain. The algorithm stands in contrast with more classical proximal gradient algorithms with smoothing, which require the computation of proximal operators at each iteration and can therefore be impractical for high-dimensional problems. We establish the theoretical convergence rate of Semi-Proximal Mirror-Prox, which exhibits the optimal complexity bounds, i.e.


Conditional Gradients for the Approximate Vanishing Ideal

Wirth, Elias, Pokutta, Sebastian

arXiv.org Artificial Intelligence

The vanishing ideal of a set of points $X\subseteq \mathbb{R}^n$ is the set of polynomials that evaluate to $0$ over all points $\mathbf{x} \in X$ and admits an efficient representation by a finite set of polynomials called generators. To accommodate the noise in the data set, we introduce the pairwise conditional gradients approximate vanishing ideal algorithm (PCGAVI) that constructs a set of generators of the approximate vanishing ideal. The constructed generators capture polynomial structures in data and give rise to a feature map that can, for example, be used in combination with a linear classifier for supervised learning. In PCGAVI, we construct the set of generators by solving constrained convex optimization problems with the pairwise conditional gradients algorithm. Thus, PCGAVI not only constructs few but also sparse generators, making the corresponding feature transformation robust and compact. Furthermore, we derive several learning guarantees for PCGAVI that make the algorithm theoretically better motivated than related generator-constructing methods.


Constrained Stochastic Nonconvex Optimization with State-dependent Markov Data

Roy, Abhishek, Balasubramanian, Krishnakumar, Ghadimi, Saeed

arXiv.org Artificial Intelligence

We study stochastic optimization algorithms for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we study both projection-based and projection-free algorithms. In both cases, we establish that the number of calls to the stochastic first-order oracle to obtain an appropriately defined $\epsilon$-stationary point is of the order $\mathcal{O}(1/\epsilon^{2.5})$. In the projection-free setting we additionally establish that the number of calls to the linear minimization oracle is of order $\mathcal{O}(1/\epsilon^{5.5})$. We also empirically demonstrate the performance of our algorithm on the problem of strategic classification with neural networks.


Second-order Conditional Gradients

Carderera, Alejandro, Pokutta, Sebastian

arXiv.org Machine Learning

An immensely powerful approach when X R n is to construct a second-order approximation to f(x) at the current iterate using first and second order information, denoted by ˆf(x), and move in the direction that minimizes this approximation, giving rise to a family of methods known as Newton methods (Kantorovich, 1948). A damped variant of the former applied to the minimization of a self-concordant function, converges globally, and shows quadratic local convergence when the iterates are close enough to the optimum (Nesterov & Nemirovskii, 1994). The global convergence of this method also extends to strongly convex and smooth function (Nesterov & Nemirovskii, 1994; Nesterov, 2013). Using a cubic regularized version of Newton's method, the global convergence of the method can also be extended to a broader class of functions than that of self-concordant or strongly convex and smooth functions (Nesterov & Polyak, 2006). When X R n is a convex set, one can use a constrained analog of these methods (Levitin & Polyak, 1966), where a quadratic approximation to the function is minimized over X at each iteration.


Multi-output Polynomial Networks and Factorization Machines

Blondel, Mathieu, Niculae, Vlad, Otsuka, Takuma, Ueda, Naonori

Neural Information Processing Systems

Factorization machines and polynomial networks are supervised polynomial models based on an efficient low-rank decomposition. We extend these models to the multi-output setting, i.e., for learning vector-valued functions, with application to multi-class or multi-task problems. We cast this as the problem of learning a 3-way tensor whose slices share a common basis and propose a convex formulation of that problem. We then develop an efficient conditional gradient algorithm and prove its global convergence, despite the fact that it involves a non-convex basis selection step. On classification tasks, we show that our algorithm achieves excellent accuracy with much sparser models than existing methods. On recommendation system tasks, we show how to combine our algorithm with a reduction from ordinal regression to multi-output classification and show that the resulting algorithm outperforms simple baselines in terms of ranking accuracy.


Multi-output Polynomial Networks and Factorization Machines

Blondel, Mathieu, Niculae, Vlad, Otsuka, Takuma, Ueda, Naonori

arXiv.org Machine Learning

Factorization machines and polynomial networks are supervised polynomial models based on an efficient low-rank decomposition. We extend these models to the multi-output setting, i.e., for learning vector-valued functions, with application to multi-class or multi-task problems. We cast this as the problem of learning a 3-way tensor whose slices share a common basis and propose a convex formulation of that problem. We then develop an efficient conditional gradient algorithm and prove its global convergence, despite the fact that it involves a non-convex basis selection step. On classification tasks, we show that our algorithm achieves excellent accuracy with much sparser models than existing methods. On recommendation system tasks, we show how to combine our algorithm with a reduction from ordinal regression to multi-output classification and show that the resulting algorithm outperforms simple baselines in terms of ranking accuracy.