plugin classifier
Binary Classification with Bounded Abstention Rate
Shekhar, Shubhanshu, Ghavamzadeh, Mohammad, Javidi, Tara
We consider the problem of binary classification with abstention in the relatively less studied \emph{bounded-rate} setting. We begin by obtaining a characterization of the Bayes optimal classifier for an arbitrary input-label distribution $P_{XY}$. Our result generalizes and provides an alternative proof for the result first obtained by \cite{chow1957optimum}, and then re-derived by \citet{denis2015consistency}, under a continuity assumption on $P_{XY}$. We then propose a plug-in classifier that employs unlabeled samples to decide the region of abstention and derive an upper-bound on the excess risk of our classifier under standard \emph{H\"older smoothness} and \emph{margin} assumptions. Unlike the plug-in rule of \citet{denis2015consistency}, our constructed classifier satisfies the abstention constraint with high probability and can also deal with discontinuities in the empirical cdf. We also derive lower-bounds that demonstrate the minimax near-optimality of our proposed algorithm. To address the excessive complexity of the plug-in classifier in high dimensions, we propose a computationally efficient algorithm that builds upon prior work on convex loss surrogates, and obtain bounds on its excess risk in the \emph{realizable} case. We empirically compare the performance of the proposed algorithm with a baseline on a number of UCI benchmark datasets.
A plug-in approach to maximising precision at the top and recall at the top
Information retrieval and binary classification can be considered equivalent problems in principle. Information retrieval means to mark documents in a set of candidate documents as relevant or non-relevant for some question, on the basis of the properties of the documents. For binary classification, the problem is to distinguish between the'positive' and'negative' instances from a dataset, based on the features of the instances. Hence, from an abstract point of view, information retrieval is a special case of binary classification, with the documents being instances, the document properties being features and'relevant' being translated as'positive'. In practice, however, the general concepts from binary classification are not always helpful for information retrieval applications. The fact that often the proportion of relevant documents in a set of documents subject to a search is small or even very small is only one of the reasons for information retrieval to be considered a field of research for its own. As a consequence, some performance measures for information retrieval methods differ from those in use for binary classifiers or are called by different names. Precision and recall are possibly the most popular performance measures(see Chapter 8 of Manning et al., 2008, for a list of performance measures) for information retrieval methods: - Precision is the proportion of documents (instances) that are truly relevant (positive) among those documents which have been predicted relevant (positive). The term precision is also commonly used (with the same meaning) in binary classification.
Pointwise adaptation via stagewise aggregation of local estimates for multiclass classification
Puchkin, Nikita, Spokoiny, Vladimir
We consider a problem of multiclass classification, where the training sample $S_n = \{(X_i, Y_i)\}_{i=1}^n$ is generated from the model $\mathbb p(Y = m | X = x) = \theta_m(x)$, $1 \leq m \leq M$, and $\theta_1(x), \dots, \theta_M(x)$ are unknown Lipschitz functions. Given a test point $X$, our goal is to estimate $\theta_1(X), \dots, \theta_M(X)$. An approach based on nonparametric smoothing uses a localization technique, i.e. the weight of observation $(X_i, Y_i)$ depends on the distance between $X_i$ and $X$. However, local estimates strongly depend on localizing scheme. In our solution we fix several schemes $W_1, \dots, W_K$, compute corresponding local estimates $\widetilde\theta^{(1)}, \dots, \widetilde\theta^{(K)}$ for each of them and apply an aggregation procedure. We propose an algorithm, which constructs a convex combination of the estimates $\widetilde\theta^{(1)}, \dots, \widetilde\theta^{(K)}$ such that the aggregated estimate behaves approximately as well as the best one from the collection $\widetilde\theta^{(1)}, \dots, \widetilde\theta^{(K)}$. We also study theoretical properties of the procedure, prove oracle results and establish rates of convergence under mild assumptions.
Scalable Optimization of Multivariate Performance Measures in Multi-instance Multi-label Learning
Aggarwal, Apoorv (Indian Institute of Technology Bombay) | Ghoshal, Sandip (Indian Institute of Technology Bombay) | Shetty, Ankith M. S. (Indian Institute of Technology Bombay) | Sinha, Suhit (Indian Institute of Technology Bombay) | Ramakrishnan, Ganesh (Indian Institute of Technology Bombay) | Kar, Purushottam (Indian Institute of Technology Kanpur) | Jain, Prateek (Microsoft Research )
The problem of multi-instance multi-label learning (MIML) requires a bag of instances to be assigned a set of labels most relevant to the bag as a whole. The problem finds numerous applications in machine learning, computer vision, and natural language processing settings where only partial or distant supervision is available. We present a novel method for optimizing multivariate performance measures in the MIML setting. Our approach MIML-perf uses a novel plug-in technique and offers a seamless way to optimize a vast variety of performance measures such as macro and micro-F measure, average precision, which are performance measures of choice in multi-label learning domains. MIML-perf offers two key benefits over the state of the art. Firstly, across a diverse range of benchmark tasks, ranging from relation extraction to text categorization and scene classification, MIML-perf offers superior performance as compared to state of the art methods designed specifically for these tasks. Secondly, MIML-perf operates with significantly reduced running times as compared to other methods, often by an order of magnitude or more.
Learning From Non-iid Data: Fast Rates for the One-vs-All Multiclass Plug-in Classifiers
Dinh, Vu, Ho, Lam Si Tung, Cuong, Nguyen Viet, Nguyen, Duy, Nguyen, Binh T.
We prove new fast learning rates for the one-vs-all multiclass plug-in classifiers trained either from exponentially strongly mixing data or from data generated by a converging drifting distribution. These are two typical scenarios where training data are not iid. The learning rates are obtained under a multiclass version of Tsybakov's margin assumption, a type of low-noise assumption, and do not depend on the number of classes. Our results are general and include a previous result for binary-class plug-in classifiers with iid data as a special case. In contrast to previous works for least squares SVMs under the binary-class setting, our results retain the optimal learning rate in the iid case.
On a Theory of Nonparametric Pairwise Similarity for Clustering: Connecting Clustering to Classification
Yang, Yingzhen, Liang, Feng, Yan, Shuicheng, Wang, Zhangyang, Huang, Thomas S.
Pairwise clustering methods partition the data space into clusters by the pairwise similarity between data points. The success of pairwise clustering largely depends on the pairwise similarity function defined over the data points, where kernel similarity is broadly used. In this paper, we present a novel pairwise clustering framework by bridging the gap between clustering and multi-class classification. This pairwise clustering framework learns an unsupervised nonparametric classifier from each data partition, and search for the optimal partition of the data by minimizing the generalization error of the learned classifiers associated with the data partitions. We consider two nonparametric classifiers in this framework, i.e. the nearest neighbor classifier and the plug-in classifier. Modeling the underlying data distribution by nonparametric kernel density estimation, the generalization error bounds for both unsupervised nonparametric classifiers are the sum of nonparametric pairwise similarity terms between the data points for the purpose of clustering. Under uniform distribution, the nonparametric similarity terms induced by both unsupervised classifiers exhibit a well known form of kernel similarity. We also prove that the generalization error bound for the unsupervised plug-in classifier is asymptotically equal to the weighted volume of cluster boundary for Low Density Separation, a widely used criteria for semi-supervised learning and clustering. Based on the derived nonparametric pairwise similarity using the plug-in classifier, we propose a new nonparametric exemplar-based clustering method with enhanced discriminative capability, whose superiority is evidenced by the experimental results.