Goto

Collaborating Authors

 Nearest Neighbor Methods


K-Nearest Neighbors explained Codementor

#artificialintelligence

Here on Codementor I usually see lots of students and developers trying to get into Machine Learning confused with complicated topics they are facing at the very beginning of their journey. I want to make a deep yet understandable introduction to the algorithm which is so simple and elegant that you would like it. If you are a Machine Learning engineer but have a limited understanding of this one, it could be also useful to read it. I was working as a software developer for years and everyone around me was talking about this brand new data science and machine learning thing (I understood that there is nothing new on this planet later), so I've decided to take masters studies in the University to get known to it. Our first module was a general introductory course to Data Science and I remember myself sitting and trying to understand what's going on.


Estimation of Information Theoretic Measures for Continuous Random Variables

Neural Information Processing Systems

We analyze the estimation of information theoretic measures of continuous random variables such as: differential entropy, mutual information or Kullback-Leibler divergence. The objective of this paper is two-fold. First, we prove that the information theoretic measure estimates using the k-nearest-neighbor density estimation with fixed k converge almost surely, even though the k-nearest-neighbor density estimation with fixed k does not converge to its true measure. Second, we show that the information theoretic measure estimates do not converge for k growing linearly with the number of samples. Nevertheless, these nonconvergent estimates can be used for solving the two-sample problem and assessing if two random variables are independent.


Learning a Distance Metric from a Network

Neural Information Processing Systems

Many real-world networks are described by both connectivity information and features for every node. To better model and understand these networks, we present structure preserving metric learning (SPML), an algorithm for learning a Mahalanobis distance metric from a network such that the learned distances are tied to the inherent connectivity structure of the network. Like the graph embedding algorithm structure preserving embedding, SPML learns a metric which is structure preserving, meaning a connectivity algorithm such as k-nearest neighbors will yield the correct connectivity when applied using the distances from the learned metric. We show a variety of synthetic and real-world experiments where SPML predicts link patterns from node features more accurately than standard techniques. We further demonstrate a method for optimizing SPML based on stochastic gradient descent which removes the running-time dependency on the size of the network and allows the method to easily scale to networks of thousands of nodes and millions of edges.


Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

Neural Information Processing Systems

We show that conventional k-nearest neighbor classification can be viewed as a special problem of the diffusion decision model in the asymptotic situation. Applying the optimal strategy associated with the diffusion decision model, an adaptive rule is developed for determining appropriate values of k in k-nearest neighbor classification. Making use of the sequential probability ratio test (SPRT) and Bayesian analysis, we propose five different criteria for adaptively acquiring nearest neighbors. Experiments with both synthetic and real datasets demonstrate the effectivness of our classification criteria. Papers published at the Neural Information Processing Systems Conference.


Efficient anomaly detection using bipartite k-NN graphs

Neural Information Processing Systems

Learning minimum volume sets of an underlying nominal distribution is a very effective approach to anomaly detection. Several approaches to learning minimum volume sets have been proposed in the literature, including the K-point nearest neighbor graph (K-kNNG) algorithm based on the geometric entropy minimization (GEM) principle [4]. The K-kNNG detector, while possessing several desirable characteristics, suffers from high computation complexity, and in [4] a simpler heuristic approximation, the leave-one-out kNNG (L1O-kNNG) was proposed. In this paper, we propose a novel bipartite k-nearest neighbor graph (BP-kNNG) anomaly detection scheme for estimating minimum volume sets. Our bipartite estimator retains all the desirable theoretical properties of the K-kNNG, while being computationally simpler than the K-kNNG and the surrogate L1O-kNNG detectors.


Phase transition in the family of p-resistances

Neural Information Processing Systems

We study the family of p-resistances on graphs for p 1. We prove that for any fixed graph, for p 1, the p-resistance coincides with the shortest path distance, for p 2 it coincides with the standard resistance distance, and for p it converges to the inverse of the minimal s-t-cut in the graph. Secondly, we consider the special case of random geometric graphs (such as k-nearest neighbor graphs) when the number n of vertices in the graph tends to infinity. We prove that an interesting phase-transition takes place. There exist two critical thresholds p * and p ** such that if p p *, then the p-resistance depends on meaningful global properties of the graph, whereas if p p **, it only depends on trivial local quantities and does not convey any useful information. We can explicitly compute the critical values: p * 1 1/(d-1) and p ** 1 1/(d-2) where d is the dimension of the underlying space (we believe that the fact that there is a small gap between p * and p ** is an artifact of our proofs.


Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

Neural Information Processing Systems

All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multi-task classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the k-nearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods.


k*-Nearest Neighbors: From Global to Local

Neural Information Processing Systems

The weighted k-nearest neighbors algorithm is one of the most fundamental non-parametric methods in pattern recognition and machine learning. The question of setting the optimal number of neighbors as well as the optimal weights has received much attention throughout the years, nevertheless this problem seems to have remained unsettled. In this paper we offer a simple approach to locally weighted regression/classification, where we make the bias-variance tradeoff explicit. Our formulation enables us to phrase a notion of optimal weights, and to efficiently find these weights as well as the optimal number of neighbors efficiently and adaptively, for each data point whose value we wish to estimate. The applicability of our approach is demonstrated on several datasets, showing superior performance over standard locally weighted methods.


Rapid Distance-Based Outlier Detection via Sampling

Neural Information Processing Systems

Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search. Papers published at the Neural Information Processing Systems Conference.


Density estimation from unweighted k-nearest neighbor graphs: a roadmap

Neural Information Processing Systems

Consider an unweighted k-nearest neighbor graph on n points that have been sampled i.i.d. We prove how one can estimate the density p just from the unweighted adjacency matrix of the graph, without knowing the points themselves or their distance or similarity scores. The key insights are that local differences in link numbers can be used to estimate some local function of p, and that integrating this function along shortest paths leads to an estimate of the underlying density. Papers published at the Neural Information Processing Systems Conference.