Goto

Collaborating Authors

 Optimization


Rank Ordering Constraints Elimination with Application for Kernel Learning

AAAI Conferences

A number of machine learning domains,such as information retrieval, recommender systems, kernel learning, neural network-biological systems etc,deal with importance scores. Very often, there existsome prior knowledge that could help improve the performance.In many cases, these prior knowledge manifest themselves in the rank ordering constraints.These inequality constraints are usually very difficult to deal with in optimization.In this paper, we provide a slack variable transformation methods, which effectively eliminatesthe rank ordering inequality constraints, and thus simplify the learning task significantly.We apply this transformation in kernel learning problem, and also provide an efficient algorithm tosolved the transformed system. On seven datasets,our approach reduces the computational time by orders of magnitudes as compared to the current standardquadratically constrained quadratic programming(QCQP) optimization approach.


Latent Smooth Skeleton Embedding

AAAI Conferences

Existing methods mostly rely on distances (or similarities) In many fields of science and engineering, one is often to model the intrinsic structure of data. They either provide confronted with the problem of dimensionality reduction a similarity matrix as a prior (Belkin and Niyogi 2001; (Burges 2009; Van der Maaten, Postma, and van den Herik Schölkopf, Smola, and Muller 1999), or learn a similarity 2009). The problem aims to extract low-dimensional structures measurement based on a subset of distances in a local from high-dimensional datasets, which are generally region (Elhamifar and Vidal 2011; Saul and Roweis characterized by much fewer degrees of freedom than actual 2003), or directly learn a kernel matrix from data (Weinberger, number of features. Packer, and Saul 2005; Xiao, Sun, and Boyd 2006; In this paper, we are particularly interested in unveiling a Mao and Tsang 2010). These distances become unreliable if smooth skeleton structure in a latent space from data with the data is noisy. Moreover, they lack the ability to model a noise. Figure 1 illustrates an intuitive example in which synthetic smooth skeleton from noisy data. As shown in Figure 1, the data points are drawn from a smooth circle with noises strict distance preservation in maximum variance unfolding in two-dimensional space. It is challenging to recover the (MVU) (Weinberger, Sha, and Saul 2004) fails to capture the circle (Figures 1(c) and 1(d)) from the noisy data without smooth circle from the data (see Figure 1 (b)).


Adaptive Proximal Average Approximation for Composite Convex Minimization

AAAI Conferences

We propose a fast first-order method to solve multi-term nonsmooth composite convex minimization problems by employing a recent proximal average approximation technique and a novel adaptive parameter tuning technique. Thanks to this powerful parameter tuning technique, the proximal gradient step can be performed with a much larger stepsize in the algorithm implementation compared with the prior PA-APG method, which is the core to enable significant improvements in practical performance. Moreover, by choosing the approximation parameter adaptively, the proposed method is shown to enjoy the O(1/k) iteration complexity theoretically without needing any extra computational cost, while the PA-APG method incurs much more iterations for convergence. The preliminary experimental results on overlapping group Lasso and graph-guided fused Lasso problems confirm our theoretic claim well, and indicate that the proposed method is almost five times faster than the state-of-the-art PA-APG method and therefore suitable for higher-precision required optimization.


A General Framework for Sparsity Regularized Feature Selection via Iteratively Reweighted Least Square Minimization

AAAI Conferences

A variety of feature selection methods based on sparsity regularization have been developed with different loss functions and sparse regularization functions. Capitalizing on the existing sparsity regularized feature selection methods, we propose a general sparsity feature selection (GSR-FS) algorithm that optimizes a ℓ 2, r (0 <  r ≤ 2) based loss function with a ℓ 2, p -norm (0 < p ≤ 2) sparse regularization. The ℓ 2, r - norm (0 < 𝑟 ≤ 2) based loss function brings flexibility to balance data-fitting and robustness to outliers by tuning its parameter, and the ℓ 2, p -norm (0 < p ≤ 1) based regularization function is able to boost the sparsity for feature selection. To solve the optimization problem with multiple non-smooth and non-convex functions when , we develop an efficient solver under the general umbrella of Iterative Reweighted Least Square (IRLS) algorithms. Our algorithm has been proved to converge with a theoretical convergence order of min(2 – r, 2 – p ) at least . The experimental results have demonstrated that our method could achieve competitive feature selection performance on publicly available datasets compared with state-of-the-art feature selection methods, with reduced computational cost.


Top-k Hierarchical Classification

AAAI Conferences

This paper studies a top-k hierarchical classification problem. In top-k classification, one is allowed to make k predictions and no penalty is incurred if at least one of k predictions is correct. In hierarchical classification, classes form a structured hierarchy, and misclassification costs depend on the relation between the correct class and the incorrect class in the hierarchy. Despite that the fact that both top-k classification and hierarchical classification have gained increasing interests, the two problems have always been studied separately. In this paper, we define a top-k hierarchical loss function using a real world application. We provide the Bayes-optimal solution that minimizes the expected top-k hierarchical misclassification cost. Via numerical experiments, we show that our solution outperforms two baseline methods that address only one of the two issues.


Lifted Inference for Convex Quadratic Programs

AAAI Conferences

Symmetry is the essential element of lifted inferencethat has recently demonstrated the possibility to perform very efficient inference in highly-connected, but symmetric probabilistic models. This raises the question, whether this holds for optimization problems in general.Here we show that for a large classof optimization methods this is actually the case.Specifically, we introduce the concept of fractionalsymmetries of convex quadratic programs (QPs),which lie at the heart of many AI and machine learning approaches,and exploit it to lift, i.e., to compress QPs.These lifted QPs can then be tackled with the usual optimization toolbox (off-the-shelf solvers, cutting plane algorithms,stochastic gradients etc.). If the original QP exhibitssymmetry, then the lifted one will generallybe more compact, and hence more efficient to solve.


Generalization Error Bounds for Optimization Algorithms via Stability

AAAI Conferences

Many machine learning tasks can be formulated as Regularized Empirical Risk Minimization (R-ERM), and solved by optimization algorithms such as gradient descent (GD), stochastic gradient descent (SGD), and stochastic variance reduction (SVRG). Conventional analysis on these optimization algorithms focuses on their convergence rates during the training process, however, people in the machine learning community may care more about the generalization performance of the learned model on unseen test data. In this paper, we investigate on this issue, by using stability as a tool. In particular, we decompose the generalization error for R-ERM, and derive its upper bound for both convex and nonconvex cases. In convex cases, we prove that the generalization error can be bounded by the convergence rate of the optimization algorithm and the stability of the R-ERM process, both in expectation (in the order of 𝒪(1/ n )+ 𝔼ρ( T )), where ρ( T ) is the convergence error and T is the number of iterations) and in high probability (in the order of 𝒪(log{1/δ / √ n + ρ( T ) with probability 1 – δ). For nonconvex cases, we can also obtain a similar expected generalization error bound. Our theorems indicate that 1) along with the training process, the generalization error will decrease for all the optimization algorithms under our investigation; 2) Comparatively speaking, SVRG has better generalization ability than GD and SGD. We have conducted experiments on both convex and nonconvex problems, and the experimental results verify our theoretical findings.


Semi-Supervised Classifications via Elastic and Robust Embedding

AAAI Conferences

Transductive semi-supervised learning can only predict labels for unlabeled data appearing in training data, and can not predict labels for testing data never appearing in training set. To handle this out-of-sample problem, many inductive methods make a constraint such that the predicted label matrix should be exactly equal to a linear model. In practice, this constraint might be too rigid to capture the manifold structure of data. In this paper, we relax this rigid constraint and propose to use an elastic constraint on the predicted label matrix such that the manifold structure can be better explored. Moreover, since unlabeled data are often very abundant in practice and usually there are some outliers, we use a non-squared loss instead of the traditional squared loss to learn a robust model. The derived problem, although is convex, has so many nonsmooth terms, which make it very challenging to solve. In the paper, we propose an efficient optimization algorithm to solve a more general problem, based on which we find the optimal solution to the derived problem.


A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

AAAI Conferences

This paper aims to provide a sharp excess risk guarantee for learning a sparse linear model without any assumptions about the strong convexity of the expected loss and the sparsity of the optimal solution in hindsight. Given a target level ε for the excess risk, an interesting question to ask is how many examples and how large the support set of the solution are enough for learning a good model with the target excess risk. To answer these questions, we present a two-stage algorithm that (i) in the first stage an epoch based stochastic optimization algorithm is exploited with an established O(1/ε) bound on the sample complexity; and (ii) in the second stage a distribution dependent randomized sparsification is presented with an O(1/ε) bound on the sparsity (referred to as support complexity) of the resulting model. Compared to previous works, our contributions lie at (i) we reduce the order of the sample complexity from O(1/ε2) to O(1/ε) without the strong convexity assumption; and (ii) we reduce the constant in O(1/ε) for the sparsity by exploring the distribution dependent sampling.


Multivariate Hawkes Processes for Large-Scale Inference

AAAI Conferences

In this paper, we present a framework for fitting multivariate Hawkes processes for large-scale problems, both in the number of events in the observed history n and the number of event types d (i.e. dimensions). The proposed Scalable Low-Rank Hawkes Process (SLRHP) framework introduces a low-rank approximation of the kernel matrix that allows to perform the nonparametric learning of the d 2 triggering kernels in at most O ( ndr 2 ) operations, where r is the rank of the approximation ( r ≪ d, n ). This comes as a major improvement to the existing state-of-the-art inference algorithms that require O ( nd 2 ) operations. Furthermore, the low-rank approximation allows SLRHP to learn representative patterns of interaction between event types, which is usually valuable for the analysis of complex processes in real-world networks.