Nearest Neighbor Methods
When are Non-Parametric Methods Robust?
Bhattacharjee, Robi, Chaudhuri, Kamalika
A growing body of research has shown that many classifiers are susceptible to {\em{adversarial examples}} -- small strategic modifications to test inputs that lead to misclassification. In this work, we study general non-parametric methods, with a view towards understanding when they are robust to these modifications. We establish general conditions under which non-parametric methods are r-consistent -- in the sense that they converge to optimally robust and accurate classifiers in the large sample limit. Concretely, our results show that when data is well-separated, nearest neighbors and kernel classifiers are r-consistent, while histograms are not. For general data distributions, we prove that preprocessing by Adversarial Pruning (Yang et. al., 2019) -- that makes data well-separated -- followed by nearest neighbors or kernel classifiers also leads to r-consistency.
Aspect Term Extraction using Graph-based Semi-Supervised Learning
Ansari, Gunjan, Saxena, Chandni, Ahmad, Tanvir, Doja, M. N.
Aspect based Sentiment Analysis is a major subarea of sentiment analysis. Many supervised and unsupervised approaches have been proposed in the past for detecting and analyzing the sentiment of aspect terms. In this paper, a graph-based semi-supervised learning approach for aspect term extraction is proposed. In this approach, every identified token in the review document is classified as aspect or non-aspect term from a small set of labeled tokens using label spreading algorithm. The k-Nearest Neighbor (kNN) for graph sparsification is employed in the proposed approach to make it more time and memory efficient. The proposed work is further extended to determine the polarity of the opinion words associated with the identified aspect terms in review sentence to generate visual aspect-based summary of review documents. The experimental study is conducted on benchmark and crawled datasets of restaurant and laptop domains with varying value of labeled instances. The results depict that the proposed approach could achieve good result in terms of Precision, Recall and Accuracy with limited availability of labeled data.
Density estimation from unweighted k-nearest neighbor graphs: a roadmap
Luxburg, Ulrike Von, Alamgir, Morteza
Consider an unweighted k-nearest neighbor graph on n points that have been sampled i.i.d. We prove how one can estimate the density p just from the unweighted adjacency matrix of the graph, without knowing the points themselves or their distance or similarity scores. The key insights are that local differences in link numbers can be used to estimate some local function of p, and that integrating this function along shortest paths leads to an estimate of the underlying density. Papers published at the Neural Information Processing Systems Conference.
A novel spike-and-wave automatic detection in EEG signals
Quintero-Rincón, Antonio, Muro, Valeria, D'Giano, Carlos, Prendes, Jorge, Batatia, Hadj
Spike-and-wave discharge (SWD) pattern classification in electroencephalography (EEG) signals is a key problem in signal processing. It is particularly important to develop a SWD automatic detection method in long-term EEG recordings since the task of marking the patters manually is time consuming, difficult and error-prone. This paper presents a new detection method with a low computational complexity that can be easily trained if standard medical protocols are respected. The detection procedure is as follows: First, each EEG signal is divided into several time segments and for each time segment, the Morlet 1-D decomposition is applied. Then three parameters are extracted from the wavelet coefficients of each segment: scale (using a generalized Gaussian statistical model), variance and median. This is followed by a k-nearest neighbors (k-NN) classifier to detect the spike-and-wave pattern in each EEG channel from these three parameters. A total of 106 spike-and-wave and 106 non-spike-and-wave were used for training, while 69 new annotated EEG segments from six subjects were used for classification. In these circumstances, the proposed methodology achieved 100% accuracy. These results generate new research opportunities for the underlying causes of the so-called absence epilepsy in long-term EEG recordings.
Democratization Social trading into Digital Banking using ML - K Nearest Neighbors
Social trading is an alternative way of trading by looking at what other traders are doing and comparing and copying their techniques and strategies. Social trading allows traders to trade online with the help of others and some have claimed shortens the learning curve from novice to experienced trader. By copying trades, traders can learn which strategies work and which do not work. Social trading is used to do speculation; in the moral context speculative practices are considered negatively and to be avoided by each individual who conversely should maintain a long term horizon avoiding any types of short term speculation. For instance, if you look at the eToro, one the biggest Social Trading Platform.
Democratization Social trading into Digital Banking using ML - K Nearest Neighbors
Social trading is an alternative way of trading by looking at what other traders are doing and comparing and copying their techniques and strategies. Social trading allows traders to trade online with the help of others and some have claimed shortens the learning curve from novice to experienced trader. By copying trades, traders can learn which strategies work and which do not work. Social trading is used to do speculation; in the moral context speculative practices are considered negatively and to be avoided by each individual who conversely should maintain a long term horizon avoiding any types of short term speculation. For instance, if you look at the eToro, one the biggest Social Trading Platform.
KNN visualization in just 13 lines of code
Let's play around with datasets to visualize how the decision boundary changes as'k' changes. Let's have a quick review… K Nearest Neighbor(KNN) algorithm is a very simple, easy to understand, versatile and one of the topmost machine learning algorithms. In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbours, with the object being assigned to the class most common among its k nearest neighbours (k is a positive integer, typically small). If k 1, then the object is simply assigned to the class of that single nearest neighbour.
Online Ranking with Concept Drifts in Streaming Data
Irurozki, Ekhine, Lobo, Jesus, Perez, Aritz, Del Ser, Javier
Two important problems in preference elicitation are rank aggregation and label ranking. Rank aggregation consists of finding a ranking that best summarizes a collection of preferences of some agents. The latter, label ranking, aims at learning a mapping between data instances and rankings defined over a finite set of categories or labels. This problem can effectively model many real application scenarios such as recommender systems. However, even when the preferences of populations usually change over time, related literature has so far addressed both problems over non-evolving preferences. This work deals with the problems of rank aggregation and label ranking over non-stationary data streams. In this context, there is a set of $n$ items and $m$ agents which provide their votes by casting a ranking of the $n$ items. The rankings are noisy realizations of an unknown probability distribution that changes over time. Our goal is to learn, in an online manner, the current ground truth distribution of rankings. We begin by proposing an aggregation function called Forgetful Borda (FBorda) that, using a forgetting mechanism, gives more importance to recently observed preferences. We prove that FBorda is a consistent estimator of the Kemeny ranking and lower bound the number of samples needed to learn the distribution while guaranteeing a certain level of confidence. Then, we develop a $k$-nearest neighbor classifier based on the proposed FBorda aggregation algorithm for the label ranking problem and demonstrate its accuracy in several scenarios of label ranking problem over evolving preferences.
The Basics: KNN for classification and regression
Data science or applied statistics courses typically start with linear models, but in its way, K-nearest neighbors is probably the simplest widely used model conceptually. KNN models are really just technical implementations of a common intuition, that things that share similar features tend to be, well, similar. This is hardly a deep insight, yet these practical implementations can be extremely powerful, and, crucially for someone approaching an unknown dataset, can handle non-linearities without any complicated data-engineering or model set up. As an illustrative example, let's consider the simplest case of using a KNN model as a classifier. Let's say you have data points that fall into one of three classes.
Supervised Learning Approach to Approximate Nearest Neighbor Search
Hyvönen, Ville, Jääsaari, Elias, Roos, Teemu
Approximate nearest neighbor search is a classic algorithmic problem where the goal is to design an efficient index structure for fast approximate nearest neighbor queries. We show that it can be framed as a classification problem and solved by training a suitable multi-label classifier and using it as an index. Compared to the existing algorithms, this supervised learning approach has several advantages: it enables adapting an index to the query distribution when the query distribution and the corpus distribution differ; it allows using training sets larger than the corpus; and in principle it enables using any multi-label classifier for approximate nearest neighbor search. We demonstrate these advantages on multiple synthetic and real-world data sets by using a random forest and an ensemble of random projection trees as the base classifiers. Introduction In k -nearest neighbor ( k -nn) search, k points that are nearest to the query point are retrieved from the corpus. Approximate nearest neighbor search is used to speed up k -nn search in applications where fast response times are critical, such as in computer vision, robotics, and recommendation systems. Traditionally, approximate nearest neighbor search is approached as a problem in algorithms and data structures. Space-partitioning methods--trees, hashing, and quantization--divide the space according to a geometric criterion. For instance, k -d trees (Bentley 1975) and principal component trees (McNames 2001) are grown by hierarchically partitioning the space along the maximum variance directions of the corpus.