Goto

Collaborating Authors

 Nearest Neighbor Methods


Machine Learning in R for beginners

#artificialintelligence

Machine learning is a branch in computer science that studies the design of algorithms that can learn. Typical machine learning tasks are concept learning, function learning or "predictive modeling", clustering and finding predictive patterns. These tasks are learned through available data that were observed through experiences or instructions, for example. Machine learning hopes that including the experience into its tasks will eventually improve the learning. The ultimate goal is to improve the learning in such a way that it becomes automatic, so that humans like ourselves don't need to interfere any more. This small tutorial is meant to introduce you to the basics of machine learning in R: more specifically, it will show you how to use R to work with the well-known machine learning algorithm called "KNN" or k-nearest neighbors. Additionally, this tutorial also covers how to use caret do to machine learning in R. If you're interested in following a course, consider checking out our Introduction to Machine Learning with R or DataCamp's Unsupervised Learning in R course!


K-Nearest Neighbors on Wisconsin Breast Cancer Data

#artificialintelligence

First, let's set things up in R by loading the necessary package and importing the data into R. the class package will be used to run the k-nearest neighbors algorithm. We will also use a specific seed so that you can reproduce this in R yourself. Importing the data, let's take a look at the basic structure of the dataset. The first variable, id, is there simply as a unique identifier for each observation. We will take it out for the purposes of our analysis.


Fast k-Nearest Neighbour Search via Dynamic Continuous Indexing

arXiv.org Artificial Intelligence

Existing methods for retrieving k-nearest neighbours suffer from the curse of dimensionality. We argue this is caused in part by inherent deficiencies of space partitioning, which is the underlying strategy used by most existing methods. We devise a new strategy that avoids partitioning the vector space and present a novel randomized algorithm that runs in time linear in dimensionality of the space and sub-linear in the intrinsic dimensionality and the size of the dataset and takes space constant in dimensionality of the space and linear in the size of the dataset. The proposed algorithm allows fine-grained control over accuracy and speed on a per-query basis, automatically adapts to variations in data density, supports dynamic updates to the dataset and is easy-to-implement. We show appealing theoretical properties and demonstrate empirically that the proposed algorithm outperforms locality-sensitivity hashing (LSH) in terms of approximation quality, speed and space efficiency.


A Bayes consistent 1-NN classifier

arXiv.org Machine Learning

We show that a simple modification of the 1-nearest neighbor classifier yields a strongly Bayes consistent learner. Prior to this work, the only strongly Bayes consistent proximity-based method was the k-nearest neighbor classifier, for k growing appropriately with sample size. We will argue that a margin-regularized 1-NN enjoys considerable statistical and algorithmic advantages over the k-NN classifier. These include user-friendly finite-sample error bounds, as well as time- and memory-efficient learning and test-point evaluation algorithms with a principled speed-accuracy tradeoff. Encouraging empirical results are reported.


Doing Machine Learning (K Nearest Neighbors) in a Messenger Chatbot

#artificialintelligence

Chatbots are taking over the world. Companies want them, developers are creating them, and more important, people are eager to try them. The use cases and experiences a chatbot can bring seem to be unlimited -- there are bots for managing group purchases, bots for helping people to obtain their immigration visas, and even bots for tracking flights. So, eventually the hype got me and I decided to build my first chatbot. My idea for this chatbot was to see how I could manually train and use a machine learning model through a chat-based interface, while trying several ideas and concepts that could enhance the experience of using a machine learning model in this way.


Minimax-optimal semi-supervised regression on unknown manifolds

arXiv.org Machine Learning

We consider semi-supervised regression when the predictor variables are drawn from an unknown manifold. A simple two step approach to this problem is to: (i) estimate the manifold geodesic distance between any pair of points using both the labeled and unlabeled instances; and (ii) apply a k nearest neighbor regressor based on these distance estimates. We prove that given sufficiently many unlabeled points, this simple method of geodesic kNN regression achieves the optimal finite-sample minimax bound on the mean squared error, as if the manifold were known. Furthermore, we show how this approach can be efficiently implemented, requiring only O(k N log N) operations to estimate the regression function at all N labeled and unlabeled points. We illustrate this approach on two datasets with a manifold structure: indoor localization using WiFi fingerprints and facial pose estimation. In both cases, geodesic kNN is more accurate and much faster than the popular Laplacian eigenvector regressor.


Optimization of distributions differences for classification

arXiv.org Machine Learning

In this paper we introduce a new classification algorithm called Optimization of Distributions Differences (ODD). The algorithm aims to find a transformation from the feature space to a new space where the instances in the same class are as close as possible to one another while the gravity centers of these classes are as far as possible from one another. This aim is formulated as a multiobjective optimization problem that is solved by a hybrid of an evolutionary strategy and the Quasi-Newton method. The choice of the transformation function is flexible and could be any continuous space function. We experiment with a linear and a non-linear transformation in this paper. We show that the algorithm can outperform 6 other state-of-the-art classification methods, namely naive Bayes, support vector machines, linear discriminant analysis, multi-layer perceptrons, decision trees, and k-nearest neighbors, in 12 standard classification datasets. Our results show that the method is less sensitive to the imbalanced number of instances comparing to these methods. We also show that ODD maintains its performance better than other classification methods in these datasets, hence, offers a better generalization ability.


RIPML: A Restricted Isometry Property based Approach to Multilabel Learning

arXiv.org Machine Learning

The multilabel learning problem with large number of labels, features, and data-points has generated a tremendous interest recently. A recurring theme of these problems is that only a few labels are active in any given datapoint as compared to the total number of labels. However, only a small number of existing work take direct advantage of this inherent extreme sparsity in the label space. By the virtue of Restricted Isometry Property (RIP), satisfied by many random ensembles, we propose a novel procedure for multilabel learning known as RIPML. During the training phase, in RIPML, labels are projected onto a random low-dimensional subspace followed by solving a least-square problem in this subspace. Inference is done by a k-nearest neighbor (kNN) based approach. We demonstrate the effectiveness of RIPML by conducting extensive simulations and comparing results with the state-of-the-art linear dimensionality reduction based approaches.


Parameter Free Large Margin Nearest Neighbor for Distance Metric Learning

AAAI Conferences

We introduce a novel supervised metric learning algorithm named parameter free large margin nearest neighbor (PFLMNN) which can be seen as an improvement of the classical large margin nearest neighbor (LMNN) algorithm. The contributions of our work consist of two aspects. First, our method discards the costterm which shrinks the distances between inquiry input and its k target neighbors (the k nearest neighbors with same labels as inquiry input) in LMNN, and only focuses on improving the action to push the imposters (the samples with different labels form the inquiry input) apart out of the neighborhood of inquiry. As a result, our method does not have the parameter needed to tune on the validating set, which makes it more convenient to use. Second, by leveraging the geometry information of the imposters, we construct a novel cost function to penalize the smalldistances between each inquiry and its imposters. Different from LMNN considering every imposter located in the neighborhood of each inquiry, our method only takes care of the nearest imposters. Because when the nearest imposter is pushed out of the neighborhood of its inquiry, other imposters would be all out. In this way, the constraints in our model are much less than that of LMNN, which makes our method much easier to find the optimal distance metric. Consequently, our method not only learns a better distance metric than LMNN, but also runs faster than LMNN. Extensive experiments on different data sets with various sizes and difficulties are conducted, and the results have shown that, compared with LMNN, PFLMNN achieves better classification results.


Dynamic time warping distance for message propagation classification in Twitter

arXiv.org Machine Learning

Social messages classification is a research domain that has attracted the attention of many researchers in these last years. Indeed, the social message is different from ordinary text because it has some special characteristics like its shortness. Then the development of new approaches for the processing of the social message is now essential to make its classification more efficient. In this paper, we are mainly interested in the classification of social messages based on their spreading on online social networks (OSN). We proposed a new distance metric based on the Dynamic Time Warping distance and we use it with the probabilistic and the evidential k Nearest Neighbors (k-NN) classifiers to classify propagation networks (PrNets) of messages. The propagation network is a directed acyclic graph (DAG) that is used to record propagation traces of the message, the traversed links and their types. We tested the proposed metric with the chosen k-NN classifiers on real world propagation traces that were collected from Twitter social network and we got good classification accuracies.