Goto

Collaborating Authors

 Nearest Neighbor Methods


Benefit of Interpolation in Nearest Neighbor Algorithms

arXiv.org Machine Learning

The over-parameterized models attract much attention in the era of data science and deep learning. It is empirically observed that although these models, e.g. deep neural networks, over-fit the training data, they can still achieve small testing error, and sometimes even {\em outperform} traditional algorithms which are designed to avoid over-fitting. The major goal of this work is to sharply quantify the benefit of data interpolation in the context of nearest neighbors (NN) algorithm. Specifically, we consider a class of interpolated weighting schemes and then carefully characterize their asymptotic performances. Our analysis reveals a U-shaped performance curve with respect to the level of data interpolation, and proves that a mild degree of data interpolation {\em strictly} improves the prediction accuracy and statistical stability over those of the (un-interpolated) optimal $k$NN algorithm. This theoretically justifies (predicts) the existence of the second U-shaped curve in the recently discovered double descent phenomenon. Note that our goal in this study is not to promote the use of interpolated-NN method, but to obtain theoretical insights on data interpolation inspired by the aforementioned phenomenon.


Exercise Classification with Machine Learning (Part I)

#artificialintelligence

The first post will focus on a more algorithmic approach using k-Nearest Neighbors to classify an unknown video, and in the second post, we'll look at an exclusively machine learning (ML) approach. Code for everything we're going to cover can be found on this GitHub repository. The algorithmic approach (Part I) is written in Swift and is available as a CocoaPod. The ML approach (Part II) is written in Python/TensorFlow and can be found as part of the GitHub repository. We want to build a system which takes as input a video of a person performing an exercise and outputs a class label which describes the video.


Rates of Convergence for Large-scale Nearest Neighbor Classification

arXiv.org Machine Learning

Nearest neighbor is a popular class of classification methods with many desirable properties. For a large data set which cannot be loaded into the memory of a single machine due to computation, communication, privacy, or ownership limitations, we consider the divide and conquer scheme: the entire data set is divided into small subsamples, on which nearest neighbor predictions are made, and then a final decision is reached by aggregating the predictions on subsamples by majority voting. We name this method the big Nearest Neighbor (bigNN) classifier, and provide its rates of convergence under minimal assumptions, in terms of both the excess risk and the classification instability, which are proven to be the same rates as the oracle nearest neighbor classifier and cannot be improved. To significantly reduce the prediction time that is required for achieving the optimal rate, we also consider the pre-training acceleration technique applied to the bigNN method, with proven convergence rate. We find that in the distributed setting, the optimal choice of the neighbor k should scale with both the total sample size and the number of partitions, and there is a theoretical upper limit for the latter. Numerical studies have verified the theoretical findings.


Understanding Supervised Learning In One Article

#artificialintelligence

As you might know, supervised machine learning is one of the most commonly used and successful types of machine learning. In this article, we will describe supervised learning in more detail and explain several popular supervised learning algorithms. Remember that supervised learning is used whenever we want to predict a certain outcome from a given input, and we have examples of input/output pairs. We build a machine learning model from these input/output pairs, which comprise our training set. Our goal is to make accurate predictions for new, never-before-seen data. Supervised learning often requires human effort to build the training set, but afterwards automates and often speeds up an otherwise laborious or infeasible task. There are two major types of supervised machine learning problems, called classification and regression. In classification, the goal is to predict a class label, which is a choice from a predefined list of possibilities.


Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

#artificialintelligence

Given a data set D containing millions of data points and a data consumer who is willing to pay for X to train a machine learning (ML) model over D, how should we distribute this X to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all N data points, it requires O(2 N) model evaluations for exact computation and O(N N) for (ฯต, ฮด)-approximation. In this paper, we focus on one popular family of ML models relying on K-nearest neighbors (KNN). The most surprising result is that for unweighted KNN classifiers and regressors, the Shapley value of all N data points can be computed, exactly, in O(N N) time -- an exponential improvement on computational complexity!


Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

#artificialintelligence

Given a data set D containing millions of data points and a data consumer who is willing to pay for X to train a machine learning (ML) model over D, how should we distribute this X to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all N data points, it requires O(2 N) model evaluations for exact computation and O(N N) for (ฯต, ฮด)-approximation. In this paper, we focus on one popular family of ML models relying on K-nearest neighbors (KNN). The most surprising result is that for unweighted KNN classifiers and regressors, the Shapley value of all N data points can be computed, exactly, in O(N N) time -- an exponential improvement on computational complexity!


Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

arXiv.org Machine Learning

Given a data set $\mathcal{D}$ containing millions of data points and a data consumer who is willing to pay for \$$X$ to train a machine learning (ML) model over $\mathcal{D}$, how should we distribute this \$$X$ to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all $N$ data points, it requires $O(2^N)$ model evaluations for exact computation and $O(N\log N)$ for $(\epsilon, \delta)$-approximation. In this paper, we focus on one popular family of ML models relying on $K$-nearest neighbors ($K$NN). The most surprising result is that for unweighted $K$NN classifiers and regressors, the Shapley value of all $N$ data points can be computed, exactly, in $O(N\log N)$ time -- an exponential improvement on computational complexity! Moreover, for $(\epsilon, \delta)$-approximation, we are able to develop an algorithm based on Locality Sensitive Hashing (LSH) with only sublinear complexity $O(N^{h(\epsilon,K)}\log N)$ when $\epsilon$ is not too small and $K$ is not too large. We empirically evaluate our algorithms on up to $10$ million data points and even our exact algorithm is up to three orders of magnitude faster than the baseline approximation algorithm. The LSH-based approximation algorithm can accelerate the value calculation process even further. We then extend our algorithms to other scenarios such as (1) weighed $K$NN classifiers, (2) different data points are clustered by different data curators, and (3) there are data analysts providing computation who also requires proper valuation.


Chatter Detection in Turning Using Machine Learning and Similarity Measures of Time Series via Dynamic Time Warping

arXiv.org Machine Learning

Chatter detection from sensor signals has been an active field of research. While some success has been reported using several featurization tools and machine learning algorithms, existing methods have several drawbacks such as manual preprocessing and requiring a large data set. In this paper, we present an alternative approach for chatter detection based on K-Nearest Neighbor (kNN) algorithm for classification and the Dynamic Time Warping (DTW) as a time series similarity measure. The used time series are the acceleration signals acquired from the tool holder in a series of turning experiments. Our results, show that this approach achieves detection accuracies that in most cases outperform existing methods. We compare our results to the traditional methods based on Wavelet Packet Transform (WPT) and the Ensemble Empirical Mode Decomposition (EEMD), as well as to the more recent Topological Data Analysis (TDA) based approach. We show that in three out of four cutting configurations our DTW-based approach attains the highest average classification rate reaching in one case as high as 99% accuracy. Our approach does not require feature extraction, is capable of reusing a classifier across different cutting configurations, and it uses reasonably sized training sets. Although the resulting high accuracy in our approach is associated with high computational cost, this is specific to the DTW implementation that we used. Specifically, we highlight available, very fast DTW implementations that can even be implemented on small consumer electronics. Therefore, further code optimization and the significantly reduced computational effort during the implementation phase make our approach a viable option for in-process chatter detection.


K Nearest Neighbor Algorithm In Python - Towards Data Science

#artificialintelligence

K-Nearest Neighbors, or KNN for short, is one of the simplest machine learning algorithms and is used in a wide array of institutions. KNN is a non-parametric, lazy learning algorithm. When we say a technique is non-parametric, it means that it does not make any assumptions about the underlying data. In other words, it makes its selection based off of the proximity to other data points regardless of what feature the numerical values represent. Being a lazy learning algorithm implies that there is little to no training phase.


Feature space transformations and model selection to improve the performance of classifiers

arXiv.org Machine Learning

Improving the performance of classifiers is the realm of prototype selection and kernel transformations. Prototype selection has been used to reduce the space complexity of k-Nearest Neighbors classifiers and to improve its accuracy, and kernel transformations enhanced the performance of linear classifiers by converting a non-linear separable problem into a linear one in the transformed space. Our proposal combines, in a model selection scheme, these transformations with classic algorithms such as Na\"ive Bayes and k-Nearest Neighbors to produce a competitive classifier. We analyzed our approach on different classification problems and compared it to state-of-the-art classifiers. The results show that the methodology proposed is competitive, obtaining the lowest rank among the classifiers being compared.