Support Vector Machines
Building Regression Models in R using Support Vector Regression
The article studies the advantage of Support Vector Regression (SVR) over Simple Linear Regression (SLR) models. SVR uses the same basic idea as Support Vector Machine (SVM), a classification algorithm, but applies it to predict real values rather than a class. SVR acknowledges the presence of non-linearity in the data and provides a proficient prediction model. Along with the thorough understanding of SVR, we also provide the reader with hands on experience of preparing the model on R. We perform SLR and SVR on the same dataset and make a comparison. The article is organized as follows; Section 1 provides a quick review of SLR and its implementation on R. Section 2 discusses the theoretical aspects of SVR and the steps to fit SVR on R. It also covers the basics of tuning SVR model.
Credible Review Detection with Limited Information using Consistency Analysis
Mukherjee, Subhabrata, Dutta, Sourav, Weikum, Gerhard
Online reviews provide viewpoints on the strengths and shortcomings of products/services, influencing potential customers' purchasing decisions. However, the proliferation of non-credible reviews -- either fake (promoting/ demoting an item), incompetent (involving irrelevant aspects), or biased -- entails the problem of identifying credible reviews. Prior works involve classifiers harnessing rich information about items/users -- which might not be readily available in several domains -- that provide only limited interpretability as to why a review is deemed non-credible. This paper presents a novel approach to address the above issues. We utilize latent topic models leveraging review texts, item ratings, and timestamps to derive consistency features without relying on item/user histories, unavailable for "long-tail" items/users. We develop models, for computing review credibility scores to provide interpretable evidence for non-credible reviews, that are also transferable to other domains -- addressing the scarcity of labeled data. Experiments on real-world datasets demonstrate improvements over state-of-the-art baselines.
Generalized support vector regression: duality and tensor-kernel representation
Salzo, Saverio, Suykens, Johan A. K.
In this paper we study the variational problem associated to support vector regression in Banach function spaces. Using the Fenchel-Rockafellar duality theory, we give explicit formulation of the dual problem as well as of the related optimality conditions. Moreover, we provide a new computational framework for solving the problem which relies on a tensor-kernel representation. This analysis overcomes the typical difficulties connected to learning in Banach spaces. We finally present a large class of tensor-kernels to which our theory fully applies: power series tensor kernels. This type of kernels describe Banach spaces of analytic functions and include generalizations of the exponential and polynomial kernels as well as, in the complex case, generalizations of the Szeg\"o and Bergman kernels.
One-Class Semi-Supervised Learning: Detecting Linearly Separable Class by its Mean
Bauman, Evgeny, Bauman, Konstantin
In this paper, we presented a novel semi-supervised one-class classification algorithm which assumes that class is linearly separable from other elements. We proved theoretically that class is linearly separable if and only if it is maximal by probability within the sets with the same mean. Furthermore, we presented an algorithm for identifying such linearly separable class utilizing linear programming. We described three application cases including an assumption of linear separability, Gaussian distribution, and the case of linear separability in transformed space of kernel functions. Finally, we demonstrated the work of the proposed algorithm on the USPS dataset and analyzed the relationship of the performance of the algorithm and the size of the initially labeled sample.
Strictly Proper Kernel Scoring Rules and Divergences with an Application to Kernel Two-Sample Hypothesis Testing
We study strictly proper scoring rules in the Reproducing Kernel Hilbert Space. We propose a general Kernel Scoring rule and associated Kernel Divergence. We consider conditions under which the Kernel Score is strictly proper. We then demonstrate that the Kernel Score includes the Maximum Mean Discrepancy as a special case. We also consider the connections between the Kernel Score and the minimum risk of a proper loss function. We show that the Kernel Score incorporates more information pertaining to the projected embedded distributions compared to the Maximum Mean Discrepancy. Finally, we show how to integrate the information provided from different Kernel Divergences, such as the proposed Bhattacharyya Kernel Divergence, using a one-class classifier for improved two-sample hypothesis testing results.
Performance Limits of Stochastic Sub-Gradient Learning, Part I: Single Agent Case
The minimization of non-differentiable convex cost functions is a critical step in the solution of many design problems [3]-[5], including the design of sparse-aware (LASSO) solutions [6], [7], support-vector machine (SVM) learners [8]-[12], or total-variation-based image denoising solutions [13], [14]. Several powerful techniques have been proposed in the literature to deal with the non-differentiability aspect of the problem formulation, including methods that employ sub-gradient iterations [3]-[5], cutting-plane techniques [15], or proximal iterations [16], [17]. This work focuses on the class of sub-gradient methods for the reasons explained in the sequel. The sub-gradient technique is closely related to the traditional gradient-descent method [3], [4] where the actual gradient is replaced by a sub-gradient at points of nondifferentiability. It is one of the simplest methods in current practice but is known to suffer from slow convergence. For instance, it is shown in [5] that, for convex cost functions, the optimal convergence rate that can be delivered by sub-gradient methods in deterministic optimization problems cannot be faster than O(1/ i) under worst case conditions, where i is the iteration index.
Fast Kronecker product kernel methods via generalized vec trick
Airola, Antti, Pahikkala, Tapio
Kronecker product kernel provides the standard approach in the kernel methods literature for learning from graph data, where edges are labeled and both start and end vertices have their own feature representations. The methods allow generalization to such new edges, whose start and end vertices do not appear in the training data, a setting known as zero-shot or zero-data learning. Such a setting occurs in numerous applications, including drug-target interaction prediction, collaborative filtering and information retrieval. Efficient training algorithms based on the so-called vec trick, that makes use of the special structure of the Kronecker product, are known for the case where the training data is a complete bipartite graph. In this work we generalize these results to non-complete training graphs. This allows us to derive a general framework for training Kronecker product kernel methods, as specific examples we implement Kronecker ridge regression and support vector machine algorithms. Experimental results demonstrate that the proposed approach leads to accurate models, while allowing order of magnitude improvements in training and prediction time.
Ranking to Learn: Feature Ranking and Selection via Eigenvector Centrality
In an era where accumulating data is easy and storing it inexpensive, feature selection plays a central role in helping to reduce the high-dimensionality of huge amounts of otherwise meaningless data. In this paper, we propose a graph-based method for feature selection that ranks features by identifying the most important ones into arbitrary set of cues. Mapping the problem on an affinity graph-where features are the nodes-the solution is given by assessing the importance of nodes through some indicators of centrality, in particular, the Eigen-vector Centrality (EC). The gist of EC is to estimate the importance of a feature as a function of the importance of its neighbors. Ranking central nodes individuates candidate features, which turn out to be effective from a classification point of view, as proved by a thoroughly experimental section. Our approach has been tested on 7 diverse datasets from recent literature (e.g., biological data and object recognition, among others), and compared against filter, embedded and wrappers methods. The results are remarkable in terms of accuracy, stability and low execution time.
Large-Scale Online Semantic Indexing of Biomedical Articles via an Ensemble of Multi-Label Classification Models
Papanikolaou, Yannis, Tsoumakas, Grigorios, Laliotis, Manos, Markantonatos, Nikos, Vlahavas, Ioannis
Background: In this paper we present the approaches and methods employed in order to deal with a large scale multi-label semantic indexing task of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge of 2014. Methods: The main contribution of this work is a multi-label ensemble method that incorporates a McNemar statistical significance test in order to validate the combination of the constituent machine learning algorithms. Some secondary contributions include a study on the temporal aspects of the BioASQ corpus (observations apply also to the BioASQ's super-set, the PubMed articles collection) and the proper adaptation of the algorithms used to deal with this challenging classification task. Results: The ensemble method we developed is compared to other approaches in experimental scenarios with subsets of the BioASQ corpus giving positive results. During the BioASQ 2014 challenge we obtained the first place during the first batch and the third in the two following batches. Our success in the BioASQ challenge proved that a fully automated machine-learning approach, which does not implement any heuristics and rule-based approaches, can be highly competitive and outperform other approaches in similar challenging contexts.
Artificial Intelligence: Monitoring the Monitors
Time for a sobering moment of truth -- very little of what's being positioned as the "new" in Machine Learning is really "New Math" at all. Artificial Neural Networks (ANN) themselves are a decades-old concept, and even then the basis of that approach shares large commonalities with other statistical techniques that are themselves decades older. Consider that the most common task for most ANNs today is classification (fraud detection, threat detection, etc.), which is germane to any number of previously established supervised classification techniques such as logistic regression, support vector machines, etc., that have been circulating in the statistical realm for decades. While the computational approaches have algebraic differences, they can generally be shown to (in the limit) converge to a small set of solutions that are largely interchangeable or at least statistically indistinguishable insofar as the models are used appropriately, underlying data structures are respected by the model choice, and each model is shown the same set of information. What actually IS new is the ability to quickly and easily perform these modeling computations at the scale to which we now can.