Goto

Collaborating Authors

 Nearest Neighbor Methods


Facebook V: Predicting Check Ins, Winner's Interview: 1st Place, Tom Van de Wiele

#artificialintelligence

From May to July 2016, over one thousand Kagglers competed in Facebook's fifth recruitment competition: Predicting Check-Ins. In this challenge, Kagglers were required to predict the most probable check-in locations occurring in artificial time and space. As the first place winner, Tom Van de Wiele, notes in this winner's interview, the uniquely designed test dataset contained about one trillion place-observation combinations, posing a huge difficulty to competitors. Tom describes how he quickly rocketed from his first getting started competition on Kaggle to first place in Facebook V through his remarkable insight into data consisting only of x,y coordinates, time, and accuracy using k-nearest neighbors and XGBoost. I have completed two Master programs at two different Belgian universities (Leuven and Ghent), one in Computer Science (2010) and one in Statistics (2016).


1.6. Nearest Neighbors -- scikit-learn 0.17.1 documentation

#artificialintelligence

Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete labels, and regression for data with continuous labels. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply "remember" all of its training data (possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree.).


An approach to dealing with missing values in heterogeneous data using k-nearest neighbors

arXiv.org Machine Learning

Techniques such as clusterization, neural networks and decision making usually rely on algorithms that are not well suited to deal with missing values. However, real world data frequently contains such cases. The simplest solution is to either substitute them by a best guess value or completely disregard the missing values. Unfortunately, both approaches can lead to biased results. In this paper, we propose a technique for dealing with missing values in heterogeneous data using imputation based on the k-nearest neighbors algorithm. It can handle real (which we refer to as crisp henceforward), interval and fuzzy data. The effectiveness of the algorithm is tested on several datasets and the numerical results are promising.


Content-based image retrieval tutorial

arXiv.org Machine Learning

This paper functions as a tutorial for individuals interested to enter the field of information retrieval but wouldn't know where to begin from. It describes two fundamental yet efficient image retrieval techniques, the first being k - nearest neighbors (knn) and the second support vector machines(svm). The goal is to provide the reader with both the theoretical and practical aspects in order to acquire a better understanding. Along with this tutorial we have also developed the equivalent software1 using the MATLAB environment in order to illustrate the techniques, so that the reader can have a hands-on experience.


Demystifying Fixed k-Nearest Neighbor Information Estimators

arXiv.org Machine Learning

Estimating mutual information from i.i.d. samples drawn from an unknown joint density function is a basic statistical problem of broad interest with multitudinous applications. The most popular estimator is one proposed by Kraskov and St\"ogbauer and Grassberger (KSG) in 2004, and is nonparametric and based on the distances of each sample to its $k^{\rm th}$ nearest neighboring sample, where $k$ is a fixed small integer. Despite its widespread use (part of scientific software packages), theoretical properties of this estimator have been largely unexplored. In this paper we demonstrate that the estimator is consistent and also identify an upper bound on the rate of convergence of the bias as a function of number of samples. We argue that the superior performance benefits of the KSG estimator stems from a curious "correlation boosting" effect and build on this intuition to modify the KSG estimator in novel ways to construct a superior estimator. As a byproduct of our investigations, we obtain nearly tight rates of convergence of the $\ell_2$ error of the well known fixed $k$ nearest neighbor estimator of differential entropy by Kozachenko and Leonenko.


How To Use Classification Machine Learning Algorithms in Weka - Machine Learning Mastery

#artificialintelligence

Weka makes a large number of classification algorithms available. The large number of machine learning algorithms available is one of the benefits of using the Weka platform to work through your machine learning problems. In this post you will discover how to use 5 top machine learning algorithms in Weka. How To Use Classification Machine Learning Algorithms in Weka Photo by Don Graham, some rights reserved. We are going to take a tour of 5 top classification algorithms in Weka.


Analysis of k-Nearest Neighbor Distances with Application to Entropy Estimation

arXiv.org Machine Learning

Estimating entropy and mutual information consistently is important for many machine learning applications. The Kozachenko-Leonenko (KL) estimator (Kozachenko & Leonenko, 1987) is a widely used nonparametric estimator for the entropy of multivariate continuous random variables, as well as the basis of the mutual information estimator of Kraskov et al. (2004), perhaps the most widely used estimator of mutual information in this setting. Despite the practical importance of these estimators, major theoretical questions regarding their finite-sample behavior remain open. This paper proves finite-sample bounds on the bias and variance of the KL estimator, showing that it achieves the minimax convergence rate for certain classes of smooth functions. In proving these bounds, we analyze finite-sample behavior of k-nearest neighbors (k-NN) distance statistics (on which the KL estimator is based). We derive concentration inequalities for k-NN distances and a general expectation bound for statistics of k-NN distances, which may be useful for other analyses of k-NN methods.


K-Nearest Neighbor Machine Learning algorithm

#artificialintelligence

The German credit dataset can be downloaded from UC Irvine, Machine learning community to indicate the predicted outcome if the loan applicant defaulted or not.


I am looking for Supervised Learning project to work upon? • /r/MachineLearning

@machinelearnbot

Hey, thankyou for the input but I am new to the field. I started it this summer. Until now I have only worked on linear,logistic regression and k-nearest neighbors. I'm looking forward to work on a project on these algorithm before I go deep in field.


Bayesian Optimization of Machine Learning Models

#artificialintelligence

Many predictive and machine learning models have structural or tuning parameters that cannot be directly estimated from the data. For example, when using K-nearest neighbor model, there is no analytical estimator for K (the number of neighbors). Typically, resampling is used to get good performance estimates of the model for a given set of values for K and the one associated with the best results is used. This is basically a grid search procedure. However, there are other approaches that can be used.