Optimization
Derivative-Free Optimization via Classification
Yu, Yang (Nanjing University) | Qian, Hong (Nanjing University) | Hu, Yi-Qi (Nanjing University)
Many randomized heuristic derivative-free optimization methods share a framework that iteratively learns a model for promising search areas and samples solutions from the model. This paper studies a particular setting of such framework, where the model is implemented by a classification model discriminating good solutions from bad ones. This setting allows a general theoretical characterization, where critical factors to the optimization are discovered. We also prove that optimization problems with Local Lipschitz continuity can be solved in polynomial time by proper configurations of this framework. Following the critical factors, we propose the randomized coordinate shrinking classification algorithm to learn the model, forming the RACOS algorithm, for optimization in continuous and discrete domains. Experiments on the testing functions as well as on the machine learning tasks including spectral clustering and classification with Ramp loss demonstrate the effectiveness of RACOS.
Constrained Submodular Minimization for Missing Labels and Class Imbalance in Multi-label Learning
Wu, Baoyuan (King Abdullah University of Science and Technology (KAUST)) | Lyu, Siwei ( State University of New York at Albany ) | Ghanem, Bernard (King Abdullah University of Science and Technology (KAUST))
Although many handle missing labels and class imbalance jointly. We formulate multi-label learning methods have been proposed in recent the problem as a transductive learning problem that years, a main challenge remains for this problem, i.e., the include five components that are label consistency, instancelevel lack of completely labeled training instances. This is important and class-level label smoothness, and two types of class because in many real life applications, most training cardinality (lower and upper) bounds. The first three components instances are only partially labeled, while other labels are are used to propagate the label information from the not provided or missing. One such example is image annotation, provided labels to missing labels, and the latter two components a human labeler can only feasibly annotates each are included to handle two types of the class imbalance training image with a subset of tags, especially when the problem. We first formulate a unified model that combines number of classes/tags is large. Learning from such partially these components as a constrained submodular minimization labeled instances is referred to as the multi-label learning problem (CSM). However, due to the class cardinality with missing labels (MLML) problem (Wu et al. 2014; constraint, it is a NPhard problem.
Unsupervised Feature Selection on Networks: A Generative View
Wei, Xiaokai (University of Illinois at Chicago) | Cao, Bokai (University of Illinois at Chicago) | Yu, Philip S. (University of Illinois at Chicago and Tsinghua University)
In the past decade, social and information networks have become prevalent, and research on the network data has attracted much attention. Besides the link structure, network data are often equipped with the content information (i.e, node attributes) that is usually noisy and characterized by high dimensionality. As the curse of dimensionality could hamper the performance of many machine learning tasks on networks (e.g., community detection and link prediction), feature selection can be a useful technique for alleviating such issue. In this paper, we investigate the problem of unsupervised feature selection on networks. Most existing feature selection methods fail to incorporate the linkage information, and the state-of-the-art approaches usually rely on pseudo labels generated from clustering. Such cluster labels may be far from accurate and can mislead the feature selection process. To address these issues, we propose a generative point of view for unsupervised features selection on networks that can seamlessly exploit the linkage and content information in a more effective manner. We assume that the link structures and node content are generated from a succinct set of high-quality features, and we find these features through maximizing the likelihood of the generation process. Experimental results on three real-world datasets show that our approach can select more discriminative features than state-of-the-art methods.
Co-Regularized PLSA for Multi-Modal Learning
Wang, Xin (State University of New York at Albany) | Chang, MingChing (State University of New York at Albany) | Ying, Yiming (State University of New York at Albany) | Lyu, Siwei (State University of New York at Albany)
Many learning problems in real world applications involve rich datasets comprising multiple information modalities. In this work, we study co-regularized PLSA(coPLSA) as an efficient solution to probabilistic topic analysis of multi-modal data. In coPLSA, similarities between topic compositions of a data entity across different data modalities are measured with divergences between discrete probabilities, which are incorporated as a co-regularizer to augment individual PLSA models over each data modality. We derive efficient iterative learning algorithms for coPLSA with symmetric KL, L2 and L1 divergences as co-regularizers, in each case the essential optimization problem affords simple numerical solutions that entail only matrix arithmetic operations and numerical solution of 1D nonlinear equations. We evaluate the performance of the coPLSA algorithms on text/image cross-modal retrieval tasks, on which they show competitive performance with state-of-the-art methods.
Semi-Supervised Dictionary Learning via Structural Sparse Preserving
Wang, Di (Wenzhou University) | Zhang, Xiaoqin (Wenzhou University) | Fan, Mingyu (Wenzhou University) | Ye, Xiuzi (Wenzhou University)
While recent techniques for discriminative dictionary learning have attained promising results on the classification tasks, their performance is highly dependent on the number of labeled samples available for training. However, labeling samples is expensive and time consuming due to the significant human effort involved. In this paper, we present a novel semi- supervised dictionary learning method which utilizes the structural sparse relationships between the labeled and unlabeled samples. Specifically, by connecting the sparse reconstruction coefficients on both the original samples and dictionary, the unlabeled samples can be automatically grouped to the different labeled samples, and the grouped samples share a small number of atoms in the dictionary via mixed l2p- norm regularization. This makes the learned dictionary more representative and discriminative since the shared atoms are learned by using the labeled and unlabeled samples potentially from the same class. Minimizing the derived objective function is a challenging task because it is non-convex and highly non-smooth. We propose an efficient optimization algorithm to solve the problem based on the block coordinate descent method. Moreover, we have a rigorous proof of the convergence of the algorithm. Extensive experiments are presented to show the superior performance of our method in classification applications.
Multitask Generalized Eigenvalue Program
Wang, Boyu (McGill University) | Pineau, Joelle (McGill University) | Balle, Borja (Lancaster University)
We present a novel multitask learning framework called multitask generalized eigenvalue program (MTGEP), which jointly solves multiple related generalized eigenvalue problems (GEPs). This framework is quite general and can be applied to many eigenvalue problems in machine learning and pattern recognition, ranging from supervised learning to unsupervised learning, such as principal component analysis (PCA), Fisher discriminant analysis (FDA), common spatial pattern (CSP), and so on. The core assumption of our approach is that the leading eigenvectors of related GEPs lie in some subspace that can be approximated by a sparse linear combination of basis vectors. As a result, these GEPs can be jointly solved by a sparse coding approach. Empirical evaluation with both synthetic and benchmark real world datasets validates the efficacy and efficiency of the proposed techniques, especially for grouped multitask GEPs.
The Hidden Convexity of Spectral Clustering
Voss, James (Ohio State University) | Belkin, Mikhail (Ohio State University) | Rademacher, Luis (Ohio State University)
In recent years, spectral clustering has become a standard method for data analysis used in a broad range of applications. In this paper we propose a new class of algorithms for multiway spectral clustering based on optimization of a certain "contrast function" over the unit sphere. These algorithms, partly inspired by certain Indepenent Component Analysis techniques, are simple, easy to implement and efficient. Geometrically, the proposed algorithms can be interpreted as hidden basis recovery by means of function optimization. We give a complete characterization of the contrast functions admissible for provable basis recovery. We show how these conditions can be interpreted as a "hidden convexity" of our optimization problem on the sphere; interestingly, we use efficient convex maximization rather than the more common convex minimization. We also show encouraging experimental results on real and simulated data.
Metric Learning for Ordinal Data
Shi, Yuan (University of Southern California) | Li, Wenzhe (University of Southern California) | Sha, Fei (University of Southern California)
A large amount of ordinal-valued data exist in many domains, including medical and health science, social science, economics, political science, etc. Unlike image and speech datasets of real-valued data, learning with ordinal variables (i.e., features) presents unique challenges. In particular, the nominal differences between those feature values, which are just ranks, do not necessarily correspond to the real distances between the corresponding categories. Given their wide existence, it is imperative to develop machine learning algorithms that specifically address the need to model and infer with such data. In this paper, we present a novel metric learning algorithm that takes into consideration the nature of ordinal data. Our approach treats ordinal values as latent variables in intervals. Our algorithm then learns what those intervals are as well as distance metrics to measure distances between latent variables in those intervals. We derive the corresponding optimization algorithm and demonstrate how that can be solved effectively. Experimental results show that the proposed approach significantly improves baselines that do not explicitly model ordinal features.
Scaling Simultaneous Optimistic Optimization for High-Dimensional Non-Convex Functions with Low Effective Dimensions
Qian, Hong (Nanjing University) | Yu, Yang (Nanjing University)
Simultaneous optimistic optimization (SOO) is a recently proposed global optimization method with a strong theoretical foundation. Previous studies have shown that SOO has a good performance in low-dimensional optimization problems, however, its performance is unsatisfactory when the dimensionality is high. This paper adapts random embedding to scaling SOO, resulting in the RESOO algorithm. We prove that the simple regret of RESOO depends only on the effective dimension of the problem, while that of SOO depends on the dimension of the solution space. Empirically, on some high-dimensional non-convex testing functions as well as hyper-parameter tuning tasks for multi-class support vector machines, RESOO shows significantly improved performance from SOO.
Reinforcement Learning with Parameterized Actions
Masson, Warwick (University of the Witwatersrand) | Ranchod, Pravesh (University of the Witwatersrand ) | Konidaris, George (Duke University)
We introduce a model-free algorithm for learning in Markov decision processes with parameterized actions—discrete actions with continuous parameters. At each step the agent must select both which action to use and which parameters to use with that action. We introduce the Q-PAMDP algorithm for learning in these domains, show that it converges to a local optimum, and compare it to direct policy search in the goal-scoring and Platform domains.