Goto

Collaborating Authors

 Nearest Neighbor Methods


Online Ranking with Concept Drifts in Streaming Data

arXiv.org Machine Learning

Two important problems in preference elicitation are rank aggregation and label ranking. Rank aggregation consists of finding a ranking that best summarizes a collection of preferences of some agents. The latter, label ranking, aims at learning a mapping between data instances and rankings defined over a finite set of categories or labels. This problem can effectively model many real application scenarios such as recommender systems. However, even when the preferences of populations usually change over time, related literature has so far addressed both problems over non-evolving preferences. This work deals with the problems of rank aggregation and label ranking over non-stationary data streams. In this context, there is a set of $n$ items and $m$ agents which provide their votes by casting a ranking of the $n$ items. The rankings are noisy realizations of an unknown probability distribution that changes over time. Our goal is to learn, in an online manner, the current ground truth distribution of rankings. We begin by proposing an aggregation function called Forgetful Borda (FBorda) that, using a forgetting mechanism, gives more importance to recently observed preferences. We prove that FBorda is a consistent estimator of the Kemeny ranking and lower bound the number of samples needed to learn the distribution while guaranteeing a certain level of confidence. Then, we develop a $k$-nearest neighbor classifier based on the proposed FBorda aggregation algorithm for the label ranking problem and demonstrate its accuracy in several scenarios of label ranking problem over evolving preferences.


The Basics: KNN for classification and regression

#artificialintelligence

Data science or applied statistics courses typically start with linear models, but in its way, K-nearest neighbors is probably the simplest widely used model conceptually. KNN models are really just technical implementations of a common intuition, that things that share similar features tend to be, well, similar. This is hardly a deep insight, yet these practical implementations can be extremely powerful, and, crucially for someone approaching an unknown dataset, can handle non-linearities without any complicated data-engineering or model set up. As an illustrative example, let's consider the simplest case of using a KNN model as a classifier. Let's say you have data points that fall into one of three classes.


Supervised Learning Approach to Approximate Nearest Neighbor Search

arXiv.org Machine Learning

Approximate nearest neighbor search is a classic algorithmic problem where the goal is to design an efficient index structure for fast approximate nearest neighbor queries. We show that it can be framed as a classification problem and solved by training a suitable multi-label classifier and using it as an index. Compared to the existing algorithms, this supervised learning approach has several advantages: it enables adapting an index to the query distribution when the query distribution and the corpus distribution differ; it allows using training sets larger than the corpus; and in principle it enables using any multi-label classifier for approximate nearest neighbor search. We demonstrate these advantages on multiple synthetic and real-world data sets by using a random forest and an ensemble of random projection trees as the base classifiers. Introduction In k -nearest neighbor ( k -nn) search, k points that are nearest to the query point are retrieved from the corpus. Approximate nearest neighbor search is used to speed up k -nn search in applications where fast response times are critical, such as in computer vision, robotics, and recommendation systems. Traditionally, approximate nearest neighbor search is approached as a problem in algorithms and data structures. Space-partitioning methods--trees, hashing, and quantization--divide the space according to a geometric criterion. For instance, k -d trees (Bentley 1975) and principal component trees (McNames 2001) are grown by hierarchically partitioning the space along the maximum variance directions of the corpus.


Benefit of Interpolation in Nearest Neighbor Algorithms

arXiv.org Machine Learning

The over-parameterized models attract much attention in the era of data science and deep learning. It is empirically observed that although these models, e.g. deep neural networks, over-fit the training data, they can still achieve small testing error, and sometimes even {\em outperform} traditional algorithms which are designed to avoid over-fitting. The major goal of this work is to sharply quantify the benefit of data interpolation in the context of nearest neighbors (NN) algorithm. Specifically, we consider a class of interpolated weighting schemes and then carefully characterize their asymptotic performances. Our analysis reveals a U-shaped performance curve with respect to the level of data interpolation, and proves that a mild degree of data interpolation {\em strictly} improves the prediction accuracy and statistical stability over those of the (un-interpolated) optimal $k$NN algorithm. This theoretically justifies (predicts) the existence of the second U-shaped curve in the recently discovered double descent phenomenon. Note that our goal in this study is not to promote the use of interpolated-NN method, but to obtain theoretical insights on data interpolation inspired by the aforementioned phenomenon.


Meta-Neighborhoods

arXiv.org Machine Learning

Traditional methods for training neural networks use training data just once, as it is discarded after training. Instead, in this work we also leverage the training data during testing to adjust the network and gain more expressivity. Our approach, named Meta-Neighborhoods, is developed under a multi-task learning framework and is a generalization of k-nearest neighbors methods. It can flexibly adapt network parameters w.r.t. different query data using their respective local neighborhood information. Local information is learned and stored in a dictionary of learnable neighbors rather than directly retrieved from the training set for greater flexibility and performance. The network parameters and the dictionary are optimized end-to-end via meta-learning. Extensive experiments demonstrate that Meta-Neighborhoods consistently improved classification and regression performance across various network architectures and datasets. We also observed superior improvements than other state-of-the-art meta-learning methods designed to improve supervised learning.


k-Relevance Vectors for Pattern Classification

arXiv.org Machine Learning

This study combines two different learning paradigms, k-nearest neighbor (k-NN) rule, as memory-based learning paradigm and relevance vector machines (RVM), as statistical learning paradigm. This combination is performed in kernel space and is called k-relevance vector (k-RV). The purpose is to improve the performance of k-NN rule. The proposed model significantly prunes irrelevant attributes. We also introduced a new parameter, responsible for early stopping of iterations in RVM. We show that the new parameter improves the classification accuracy of k-RV. Intensive experiments are conducted on several classification datasets from University of California Irvine (UCI) repository and two real datasets from computer vision domain. The performance of k-RV is highly competitive compared to a few state-of-the-arts in terms of classification accuracy.


Detecting Adversarial Samples Using Influence Functions and Nearest Neighbors

arXiv.org Machine Learning

Deep neural networks (DNNs) are notorious for their vulnerability to adversarial attacks, which are small perturbations added to their input images to mislead their prediction. Detection of adversarial examples is, therefore, a fundamental requirement for robust classification frameworks. In this work, we present a method for detecting such adversarial attacks, which is suitable for any pre-trained neural network classifier. We use influence functions to measure the impact of every training sample on the validation set data. From the influence scores, we find the most supportive training samples for any given validation example. A k-nearest neighbor (k-NN) model fitted on the DNN's activation layers is employed to search for the ranking of these supporting training samples. We observe that these samples are highly correlated with the nearest neighbors of the normal inputs, while this correlation is much weaker for adversarial inputs. We train an adversarial detector using the k-NN ranks and distances and show that it successfully distinguishes adversarial examples, getting state-of-the-art results on four attack methods with three datasets.


Exercise Classification with Machine Learning (Part I)

#artificialintelligence

The first post will focus on a more algorithmic approach using k-Nearest Neighbors to classify an unknown video, and in the second post, we'll look at an exclusively machine learning (ML) approach. Code for everything we're going to cover can be found on this GitHub repository. The algorithmic approach (Part I) is written in Swift and is available as a CocoaPod. The ML approach (Part II) is written in Python/TensorFlow and can be found as part of the GitHub repository. We want to build a system which takes as input a video of a person performing an exercise and outputs a class label which describes the video.


Student Performance Prediction with Optimum Multilabel Ensemble Model

arXiv.org Machine Learning

One of the important measures of quality of education is the performance of students in the academic settings. Nowadays, abundant data is stored in educational institutions about students which can help to discover insight on how students are learning and how to improve their performance ahead of time using data mining techniques. In this paper, we developed a student performance prediction model that predicts the performance of high school students for the next semester for five courses. We modeled our prediction system as a multi-label classification task and used support vector machine (SVM), Random Forest (RF), K-nearest Neighbors (KNN), and Mult-layer perceptron (MLP) as base-classifiers to train our model. We further improved the performance of the prediction model using state-of-the-art partitioning schemes to divide the label space into smaller spaces and use Label Powerset (LP) transformation method to transform each labelset into a multi-class classification task. The proposed model achieved better performance in terms of different evaluation metrics when compared to other multi-label learning tasks such as binary relevance and classifier chains.


Rates of Convergence for Large-scale Nearest Neighbor Classification

arXiv.org Machine Learning

Nearest neighbor is a popular class of classification methods with many desirable properties. For a large data set which cannot be loaded into the memory of a single machine due to computation, communication, privacy, or ownership limitations, we consider the divide and conquer scheme: the entire data set is divided into small subsamples, on which nearest neighbor predictions are made, and then a final decision is reached by aggregating the predictions on subsamples by majority voting. We name this method the big Nearest Neighbor (bigNN) classifier, and provide its rates of convergence under minimal assumptions, in terms of both the excess risk and the classification instability, which are proven to be the same rates as the oracle nearest neighbor classifier and cannot be improved. To significantly reduce the prediction time that is required for achieving the optimal rate, we also consider the pre-training acceleration technique applied to the bigNN method, with proven convergence rate. We find that in the distributed setting, the optimal choice of the neighbor k should scale with both the total sample size and the number of partitions, and there is a theoretical upper limit for the latter. Numerical studies have verified the theoretical findings.