Statistical Learning
Context-Independent Claim Detection for Argument Mining
Lippi, Marco (University of Bologna) | Torroni, Paolo (University of Bologna)
Argumentation mining aims to automatically identify structured argument data from unstructured natural language text. This challenging, multi-faceted task is recently gaining a growing attention, especially due to its many potential applications. One particularly important aspect of argumentation mining is claim identification. Most of the current approaches are engineered to address specific domains. However, argumentative sentences are often characterized by common rhetorical structures, independently of the domain. We thus propose a method that exploits structured parsing information to detect claims without resorting to contextual information, and yet achieve a performance comparable to that of state-of-the-art methods that heavily rely on the context.
Characterization of Scoring Rules with Distances: Application to the Clustering of Rankings
Viappiani, Paolo (Sorbonne Universities and CNRS)
Positional scoring rules are often used for rank aggregation. In this work we study how scoring rules can be formulated as the minimization of some distance measures between rankings, and we also consider a new family of aggregation methods, called biased scoring rules. This work extends a previous known observation connecting Borda count with the minimization of the sum of the Spearman distances (calculated with respect to a set of input rankings). In particular we consider generalizations of the Spearman distance that can give different weights to items and positions; we also handle the case of incomplete rank data. This has applications in the clustering of rank data, where two main steps need to be performed: aggregating rankings of the same cluster into a representative ranking (the cluster's centroid) and assigning each ranking to its closest centroid. Using the proper combination of scoring rules (for aggregation) and distances (for assignment), it is possible to perform clustering in a computationally efficient way and as well account for specific desired behaviors (give more weight to top positions, bias the centroids in favor of particular items).
A Generalized Kernel Approach to Structured Output Learning
Kadri, Hachem, Ghavamzadeh, Mohammad, Preux, Philippe
We study the problem of structured output learning from a regression perspective. We first provide a general formulation of the kernel dependency estimation (KDE) approach to this problem using operator-valued kernels. Our formulation overcomes the two main limitations of the original KDE approach, namely the decoupling between outputs in the image space and the inability to use a joint feature space. We then propose a covariance-based operator-valued kernel that allows us to take into account the structure of the kernel feature space. This kernel operates on the output space and only encodes the interactions between the outputs without any reference to the input space. To address this issue, we introduce a variant of our KDE method based on the conditional covariance operator that in addition to the correlation between the outputs takes into account the effects of the input variables. Finally, we evaluate the performance of our KDE approach on three structured output problems, and compare it to the state-of-the-art kernelbased structured output regression methods.
Parallel MMF: a Multiresolution Approach to Matrix Computation
Kondor, Risi, Teneva, Nedelina, Mudrakarta, Pramod K.
Multiresolution Matrix Factorization (MMF) was recently introduced as a method for finding multiscale structure and defining wavelets on graphs/matrices. In this paper we derive pMMF, a parallel algorithm for computing the MMF factorization. Empirically, the running time of pMMF scales linearly in the dimension for sparse matrices. We argue that this makes pMMF a valuable new computational primitive in its own right, and present experiments on using pMMF for two distinct purposes: compressing matrices and preconditioning large sparse linear systems.
Certifying and removing disparate impact
Feldman, Michael, Friedler, Sorelle, Moeller, John, Scheidegger, Carlos, Venkatasubramanian, Suresh
What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process. When the process is implemented using computers, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the algorithm, we propose making inferences based on the data the algorithm uses. We make four contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on analyzing the information leakage of the protected class from the other data attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.
The Role of Principal Angles in Subspace Classification
Huang, Jiaji, Qiu, Qiang, Calderbank, Robert
Abstract--Subspace models play an important role in a wide range of signal processing tasks, and this paper explores how the pairwise geometry of subspaces influences the probability of misclassification. When the mismatch between the signal and the model is vanishingly small, the probability of misclassification is determined by the product of the sines of the principal angles between subspaces. When the mismatch is more significant, the probability of misclassification is determined by the sum of the squares of the sines of the principal angles. Reliability of classification is derived in terms of the distribution of signal energy across principal vectors. Larger principal angles lead to smaller classification error, motivating a linear transform that optimizes principal angles. The transform presented here (TRAIT) preserves some specific characteristic of each individual class, and this approach is shown to be complementary to a previously developed transform (LRT) that enlarges inter-class distance while suppressing intra-class dispersion. Theoretical results are supported by demonstration of superior classification accuracy on synthetic and measured data even in the presence of significant model mismatch. IGNALS that are nominally high dimensional often exhibit a low dimensional geometric structure. For example, fixed-pose images of human faces are recorded using more than 1000 pixels, but can be represented by a 9-dimensional harmonic subspace [1]. Motion trajectories of a rigid body might be recorded by hundreds of sensors, but must intrinsically be represented by a 4-dimensional subspace [2]. There are many more examples where a low-dimensional subspace model captures intrinsic geometric structure, ranging from user ratings in a recommendation system [3] to signals emitted by multiple sources impinging at an antenna array [4]. The subspace geometry has assisted tasks of interest to both signal processing [5], [6] and machine learning communities [7], [8]. It can be used to approximate a nonlinear manifold by fitting mixture components to local patches of the manifold [5], [9], hence providing a high fidelity representation of a wide variety of signal geometries.
ALEVS: Active Learning by Statistical Leverage Sampling
Active learning aims to obtain a classifier of high accuracy by using fewer label requests in comparison to passive learning by selecting effective queries. Many active learning methods have been developed in the past two decades, which sample queries based on informativeness or representativeness of unlabeled data points. In this work, we explore a novel querying criterion based on statistical leverage scores. The statistical leverage scores of a row in a matrix are the squared row-norms of the matrix containing its (top) left singular vectors and is a measure of influence of the row on the matrix. Leverage scores have been used for detecting high influential points in regression diagnostics (Chatterjee & Hadi, 1986) and have been recently shown to be useful for data analysis (Drineas et al., 2008) and randomized low-rank matrix approximation algorithms (Gittens & Mahoney, 2013). We explore how sampling data instances with high statistical leverage scores perform in active learning. Our empirical comparison on several binary classification datasets indicate that querying high leverage points is an effective strategy.
An SVM-like Approach for Expectile Regression
Farooq, Muhammad, Steinwart, Ingo
In standard nonparametric regression analysis, most of the methods developed so far are based on the least square loss function for estimating conditional expectations. In many applications, however, it is required to study conditional distributions beyond means. A nice tool for this purpose was offered by [20] in the form of quantile regression, which allows both the location and the spread of the response variable to be studied by using asymmetric least absolute deviation loss function (ALAD). We refer the reader to [19, 37, 9, 33] and references therein, for details description and different estimation methods for quantile regression.
Rich Component Analysis
In many settings, we have multiple data sets (also called views) that capture different and overlapping aspects of the same phenomenon. We are often interested in finding patterns that are unique to one or to a subset of the views. For example, we might have one set of molecular observations and one set of physiological observations on the same group of individuals, and we want to quantify molecular patterns that are uncorrelated with physiology. Despite being a common problem, this is highly challenging when the correlations come from complex distributions. In this paper, we develop the general framework of Rich Component Analysis (RCA) to model settings where the observations from different views are driven by different sets of latent components, and each component can be a complex, high-dimensional distribution. We introduce algorithms based on cumulant extraction that provably learn each of the components without having to model the other components. We show how to integrate RCA with stochastic gradient descent into a meta-algorithm for learning general models, and demonstrate substantial improvement in accuracy on several synthetic and real datasets in both supervised and unsupervised tasks. Our method makes it possible to learn latent variable models when we don't have samples from the true model but only samples after complex perturbations.
Practical Inexact Proximal Quasi-Newton Method with Global Complexity Analysis
Scheinberg, Katya, Tang, Xiaocheng
Recently several methods were proposed for sparse optimization which make careful use of second-order information [10, 28, 16, 3] to improve local convergence rates. These methods construct a composite quadratic approximation using Hessian information, optimize this approximation using a first-order method, such as coordinate descent and employ a line search to ensure sufficient descent. Here we propose a general framework, which includes slightly modified versions of existing algorithms and also a new algorithm, which uses limited memory BFGS Hessian approximations, and provide a novel global convergence rate analysis, which covers methods that solve subproblems via coordinate descent.