Optimization
Rank Ordering Constraints Elimination with Application for Kernel Learning
Xie, Ying (Anhui University) | Ding, Chris H. Q. (University of Texas at Arlington) | Gong, Yihong (Xian Jiaotong University) | Wu, Zongze (Guangdong University of Technology)
A number of machine learning domains,such as information retrieval, recommender systems, kernel learning, neural network-biological systems etc,deal with importance scores. Very often, there existsome prior knowledge that could help improve the performance.In many cases, these prior knowledge manifest themselves in the rank ordering constraints.These inequality constraints are usually very difficult to deal with in optimization.In this paper, we provide a slack variable transformation methods, which effectively eliminatesthe rank ordering inequality constraints, and thus simplify the learning task significantly.We apply this transformation in kernel learning problem, and also provide an efficient algorithm tosolved the transformed system. On seven datasets,our approach reduces the computational time by orders of magnitudes as compared to the current standardquadratically constrained quadratic programming(QCQP) optimization approach.
Latent Smooth Skeleton Embedding
Wang, Li (University of Illinois at Chicago) | Mao, Qi (HERE Company) | Tsang, Ivor W. (University of Technoloy Sydney)
Existing methods mostly rely on distances (or similarities) In many fields of science and engineering, one is often to model the intrinsic structure of data. They either provide confronted with the problem of dimensionality reduction a similarity matrix as a prior (Belkin and Niyogi 2001; (Burges 2009; Van der Maaten, Postma, and van den Herik Schölkopf, Smola, and Muller 1999), or learn a similarity 2009). The problem aims to extract low-dimensional structures measurement based on a subset of distances in a local from high-dimensional datasets, which are generally region (Elhamifar and Vidal 2011; Saul and Roweis characterized by much fewer degrees of freedom than actual 2003), or directly learn a kernel matrix from data (Weinberger, number of features. Packer, and Saul 2005; Xiao, Sun, and Boyd 2006; In this paper, we are particularly interested in unveiling a Mao and Tsang 2010). These distances become unreliable if smooth skeleton structure in a latent space from data with the data is noisy. Moreover, they lack the ability to model a noise. Figure 1 illustrates an intuitive example in which synthetic smooth skeleton from noisy data. As shown in Figure 1, the data points are drawn from a smooth circle with noises strict distance preservation in maximum variance unfolding in two-dimensional space. It is challenging to recover the (MVU) (Weinberger, Sha, and Saul 2004) fails to capture the circle (Figures 1(c) and 1(d)) from the noisy data without smooth circle from the data (see Figure 1 (b)).
Adaptive Proximal Average Approximation for Composite Convex Minimization
Shen, Li (South China University of Technology) | Liu, Wei (Tencent AI Lab) | Huang, Junzhou (University of Texas at Arlington) | Jiang, Yu-Gang (Fudan University) | Ma, Shiqian (The Chinese University of Hong Kong)
We propose a fast first-order method to solve multi-term nonsmooth composite convex minimization problems by employing a recent proximal average approximation technique and a novel adaptive parameter tuning technique. Thanks to this powerful parameter tuning technique, the proximal gradient step can be performed with a much larger stepsize in the algorithm implementation compared with the prior PA-APG method, which is the core to enable significant improvements in practical performance. Moreover, by choosing the approximation parameter adaptively, the proposed method is shown to enjoy the O(1/k) iteration complexity theoretically without needing any extra computational cost, while the PA-APG method incurs much more iterations for convergence. The preliminary experimental results on overlapping group Lasso and graph-guided fused Lasso problems confirm our theoretic claim well, and indicate that the proposed method is almost five times faster than the state-of-the-art PA-APG method and therefore suitable for higher-precision required optimization.
A General Framework for Sparsity Regularized Feature Selection via Iteratively Reweighted Least Square Minimization
Peng, Hanyang (Institute of Automation, Chinese Academy of Sciences and University of Chinese Academy of Sciences) | Fan, Yong (Perelman School of Medicine, University of Pennsylvania,)
A variety of feature selection methods based on sparsity regularization have been developed with different loss functions and sparse regularization functions. Capitalizing on the existing sparsity regularized feature selection methods, we propose a general sparsity feature selection (GSR-FS) algorithm that optimizes a ℓ 2, r (0 < r ≤ 2) based loss function with a ℓ 2, p -norm (0 < p ≤ 2) sparse regularization. The ℓ 2, r - norm (0 < 𝑟 ≤ 2) based loss function brings flexibility to balance data-fitting and robustness to outliers by tuning its parameter, and the ℓ 2, p -norm (0 < p ≤ 1) based regularization function is able to boost the sparsity for feature selection. To solve the optimization problem with multiple non-smooth and non-convex functions when , we develop an efficient solver under the general umbrella of Iterative Reweighted Least Square (IRLS) algorithms. Our algorithm has been proved to converge with a theoretical convergence order of min(2 – r, 2 – p ) at least . The experimental results have demonstrated that our method could achieve competitive feature selection performance on publicly available datasets compared with state-of-the-art feature selection methods, with reduced computational cost.
Top-k Hierarchical Classification
This paper studies a top-k hierarchical classification problem. In top-k classification, one is allowed to make k predictions and no penalty is incurred if at least one of k predictions is correct. In hierarchical classification, classes form a structured hierarchy, and misclassification costs depend on the relation between the correct class and the incorrect class in the hierarchy. Despite that the fact that both top-k classification and hierarchical classification have gained increasing interests, the two problems have always been studied separately. In this paper, we define a top-k hierarchical loss function using a real world application. We provide the Bayes-optimal solution that minimizes the expected top-k hierarchical misclassification cost. Via numerical experiments, we show that our solution outperforms two baseline methods that address only one of the two issues.
Lifted Inference for Convex Quadratic Programs
Mladenov, Martin ( Technische Universität Dortmund ) | Kleinhans, Leonard (Technische Universität Dortmund) | Kersting, Kristian (Technische Universität Dortmund)
Symmetry is the essential element of lifted inferencethat has recently demonstrated the possibility to perform very efficient inference in highly-connected, but symmetric probabilistic models. This raises the question, whether this holds for optimization problems in general.Here we show that for a large classof optimization methods this is actually the case.Specifically, we introduce the concept of fractionalsymmetries of convex quadratic programs (QPs),which lie at the heart of many AI and machine learning approaches,and exploit it to lift, i.e., to compress QPs.These lifted QPs can then be tackled with the usual optimization toolbox (off-the-shelf solvers, cutting plane algorithms,stochastic gradients etc.). If the original QP exhibitssymmetry, then the lifted one will generallybe more compact, and hence more efficient to solve.
Generalization Error Bounds for Optimization Algorithms via Stability
Meng, Qi (Peking University) | Wang, Yue (Beijing Jiaotong University) | Chen, Wei (Microsoft Research) | Wang, Taifeng (Microsoft Research) | Ma, Zhi-Ming (Chinese Academy of Mathematics and Systems Science) | Liu, Tie-Yan (Microsoft Research)
Many machine learning tasks can be formulated as Regularized Empirical Risk Minimization (R-ERM), and solved by optimization algorithms such as gradient descent (GD), stochastic gradient descent (SGD), and stochastic variance reduction (SVRG). Conventional analysis on these optimization algorithms focuses on their convergence rates during the training process, however, people in the machine learning community may care more about the generalization performance of the learned model on unseen test data. In this paper, we investigate on this issue, by using stability as a tool. In particular, we decompose the generalization error for R-ERM, and derive its upper bound for both convex and nonconvex cases. In convex cases, we prove that the generalization error can be bounded by the convergence rate of the optimization algorithm and the stability of the R-ERM process, both in expectation (in the order of 𝒪(1/ n )+ 𝔼ρ( T )), where ρ( T ) is the convergence error and T is the number of iterations) and in high probability (in the order of 𝒪(log{1/δ / √ n + ρ( T ) with probability 1 – δ). For nonconvex cases, we can also obtain a similar expected generalization error bound. Our theorems indicate that 1) along with the training process, the generalization error will decrease for all the optimization algorithms under our investigation; 2) Comparatively speaking, SVRG has better generalization ability than GD and SGD. We have conducted experiments on both convex and nonconvex problems, and the experimental results verify our theoretical findings.
Semi-Supervised Classifications via Elastic and Robust Embedding
Liu, Yun (University of Texas at Arlington) | Guo, Yiming (Illinois Institute of Technology) | Wang, Hua (Colorado School of Mines) | Nie, Feiping (University of Texas at Arlington) | Huang, Heng (University of Texas at Arlington)
Transductive semi-supervised learning can only predict labels for unlabeled data appearing in training data, and can not predict labels for testing data never appearing in training set. To handle this out-of-sample problem, many inductive methods make a constraint such that the predicted label matrix should be exactly equal to a linear model. In practice, this constraint might be too rigid to capture the manifold structure of data. In this paper, we relax this rigid constraint and propose to use an elastic constraint on the predicted label matrix such that the manifold structure can be better explored. Moreover, since unlabeled data are often very abundant in practice and usually there are some outliers, we use a non-squared loss instead of the traditional squared loss to learn a robust model. The derived problem, although is convex, has so many nonsmooth terms, which make it very challenging to solve. In the paper, we propose an efficient optimization algorithm to solve a more general problem, based on which we find the optimal solution to the derived problem.
A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis
Li, Zhe (The University of Iowa) | Yang, Tianbao (The University of Iowa) | Zhang, Lijun (Nanjing University) | Jin, Rong (Alibaba Group)
This paper aims to provide a sharp excess risk guarantee for learning a sparse linear model without any assumptions about the strong convexity of the expected loss and the sparsity of the optimal solution in hindsight. Given a target level ε for the excess risk, an interesting question to ask is how many examples and how large the support set of the solution are enough for learning a good model with the target excess risk. To answer these questions, we present a two-stage algorithm that (i) in the first stage an epoch based stochastic optimization algorithm is exploited with an established O(1/ε) bound on the sample complexity; and (ii) in the second stage a distribution dependent randomized sparsification is presented with an O(1/ε) bound on the sparsity (referred to as support complexity) of the resulting model. Compared to previous works, our contributions lie at (i) we reduce the order of the sample complexity from O(1/ε2) to O(1/ε) without the strong convexity assumption; and (ii) we reduce the constant in O(1/ε) for the sparsity by exploring the distribution dependent sampling.
Multivariate Hawkes Processes for Large-Scale Inference
Lemonnier, Rémi (Université Paris-Saclay) | Scaman, Kevin (Université Paris-Saclay and Microsoft Research) | Kalogeratos, Argyris (Université Paris-Saclay)
In this paper, we present a framework for fitting multivariate Hawkes processes for large-scale problems, both in the number of events in the observed history n and the number of event types d (i.e. dimensions). The proposed Scalable Low-Rank Hawkes Process (SLRHP) framework introduces a low-rank approximation of the kernel matrix that allows to perform the nonparametric learning of the d 2 triggering kernels in at most O ( ndr 2 ) operations, where r is the rank of the approximation ( r ≪ d, n ). This comes as a major improvement to the existing state-of-the-art inference algorithms that require O ( nd 2 ) operations. Furthermore, the low-rank approximation allows SLRHP to learn representative patterns of interaction between event types, which is usually valuable for the analysis of complex processes in real-world networks.