Nearest Neighbor Methods
Exploring the Characterization and Classification of EEG Signals for a Computer-Aided Epilepsy Diagnosis System
Epilepsy occurs when localized electrical activity of neurons suffer from an imbalance. One of the most adequate methods for diagnosing and monitoring is via the analysis of electroencephalographic (EEG) signals. Despite there is a wide range of alternatives to characterize and classify EEG signals for epilepsy analysis purposes, many key aspects related to accuracy and physiological interpretation are still considered as open issues. In this paper, this work performs an exploratory study in order to identify the most adequate frequently-used methods for characterizing and classifying epileptic seizures. In this regard, a comparative study is carried out on several subsets of features using four representative classifiers: Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K-Nearest Neighbor (KNN), and Support Vector Machine (SVM).
Calculate Similarity -- the most relevant Metrics in a Nutshell
Many data science techniques are based on measuring similarity and dissimilarity between objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to compute the distance between the cluster centroids and it's assigned data points. Recommendation engines use neighborhood based collaborative filtering methods which identify an individual's neighbor based on the similarity/dissimilarity to the other users. In this blog post I will take a look at the most relevant similarity metrics in practice. Measuring similarity between objects can be performed in a number of ways.
CyberPoint · Blog · Using Compression to Compare Objects
In my previous blog post, I discussed our endeavor to benefit from unsupervised learning on CyberPoint's malware dataset. One of the more intriguing tools I played with during that effort was the normalized compression distance (NCD). It achieves this by approximating the normalized Kolmogorov distance. The Kolmogorov distance between two objects is actually pretty easy to conceptualize -- it is the length of the shortest program that can transform one object into the other. Unlike many popular similarity measures, this provides a universal notion of similarity by quantifying the difference between two objects without restricting the type of difference.
Consistent recovery threshold of hidden nearest neighbor graphs
Ding, Jian, Wu, Yihong, Xu, Jiaming, Yang, Dana
Jian Ding, Yihong Wu, Jiaming Xu, and Dana Yang November 20, 2019 Abstract Motivated by applications such as discovering strong ties in social networks and assembling genome subsequences in biology, we study the problem of recovering a hidden 2 k -nearest neighbor (NN) graph in an n -vertex complete graph, whose edge weights are independent and distributed according to P n for edges in the hidden 2 k -NN graph and Q n otherwise. We focus on two types of asymptotic recovery guarantees as n: (1) exact recovery: all edges are classified correctly with probability tending to one; (2) almost exact recovery: the expected number of misclassified edges is o (nk). We show that the maximum likelihood estimator achieves (1) exact recovery for 2 k n o(1) if lim inf 2α n log n 1; (2) almost exact recovery for 1 k o null log n log log nnull if lim inf kD ( P n Q n) log n 1, where α n null 2 log null dP ndQ n is the R enyi divergence of order 1 2 and D (P n Q n) is the Kullback-Leibler divergence.
Justification-Based Reliability in Machine Learning
Virani, Nurali, Iyer, Naresh, Yang, Zhaoyuan
With the advent of Deep Learning, the field of machine learning (ML) has surpassed human-level performance on diverse classification tasks. At the same time, there is a stark need to characterize and quantify reliability of a model's prediction on individual samples. This is especially true in application of such models in safety-critical domains of industrial control and healthcare. To address this need, we link the question of reliability of a model's individual prediction to the epistemic uncertainty of the model's prediction. More specifically, we extend the theory of Justified True Belief (JTB) in epistemology, created to study the validity and limits of human-acquired knowledge, towards characterizing the validity and limits of knowledge in supervised classifiers. We present an analysis of neural network classifiers linking the reliability of its prediction on an input to characteristics of the support gathered from the input and latent spaces of the network. We hypothesize that the JTB analysis exposes the epistemic uncertainty (or ignorance) of a model with respect to its inference, thereby allowing for the inference to be only as strong as the justification permits. We explore various forms of support (for e.g., k-nearest neighbors (k-NN) and l_p-norm based) generated for an input, using the training data to construct a justification for the prediction with that input. Through experiments conducted on simulated and real datasets, we demonstrate that our approach can provide reliability for individual predictions and characterize regions where such reliability cannot be ascertained.
An Empirical and Comparative Analysis of Data Valuation with Scalable Algorithms
Jia, Ruoxi, Sun, Xuehui, Xu, Jiacen, Zhang, Ce, Li, Bo, Song, Dawn
This paper focuses on valuating training data for supervised learning tasks and studies the Shapley value, a data value notion originated in cooperative game theory. The Shapley value defines a unique value distribution scheme that satisfies a set of appealing properties desired by a data value notion. However, the Shapley value requires exponential complexity to calculate exactly. Existing approximation algorithms, although achieving great improvement over the exact algorithm, relies on retraining models for multiple times, thus remaining limited when applied to larger-scale learning tasks and real-world datasets. In this work, we develop a simple and efficient heuristic for data valuation based on the Shapley value with complexity independent with the model size. The key idea is to approximate the model via a $K$-nearest neighbor ($K$NN) classifier, which has a locality structure that can lead to efficient Shapley value calculation. We evaluate the utility of the values produced by the $K$NN proxies in various settings, including label noise correction, watermark detection, data summarization, active data acquisition, and domain adaption. Extensive experiments demonstrate that our algorithm achieves at least comparable utility to the values produced by existing algorithms while significant efficiency improvement. Moreover, we theoretically analyze the Shapley value and justify its advantage over the leave-one-out error as a data value measure.
An "outside the box" solution for imbalanced data classification
Jegierski, Hubert, Saganowski, Stanisław
A common problem of the real-world data sets is the class imbalance, which can significantly affect the classification abilities of classifiers. Numerous methods have been proposed to cope with this problem; however, even state-of-the-art methods offer a limited improvement (if any) for data sets with critically under-represented minority classes. For such problematic cases, an "outside the box" solution is required. Therefore, we propose a novel technique, called enrichment, which uses the information (observations) from the external data set(s). We present three approaches to implement enrichment technique: (1) selecting observations randomly, (2) iteratively choosing observations that improve the classification result, (3) adding observations that help the classifier to determine the border between classes better. We then thoroughly analyze developed solutions on ten real-world data sets to experimentally validate their usefulness. On average, our best approach improves the classification quality by 27\%, and in the best case, by outstanding 66\%. We also compare our technique with the universally applicable state-of-the-art methods. We find that our technique surpasses the existing methods performing, on average, 21\% better. The advantage is especially noticeable for the smallest data sets, for which existing methods failed, while our solutions achieved the best results. Additionally, our technique applies to both the multi-class and binary classification tasks. It can also be combined with other techniques dealing with the class imbalance problem.
Democratization Social trading into Digital Banking using ML - K Nearest Neighbors
Social trading is an alternative way of trading by looking at what other traders are doing and comparing and copying their techniques and strategies. Social trading allows traders to trade online with the help of others and some have claimed shortens the learning curve from novice to experienced trader. By copying trades, traders can learn which strategies work and which do not work. Social trading is used to do speculation; in the moral context speculative practices are considered negatively and to be avoided by each individual who conversely should maintain a long term horizon avoiding any types of short term speculation. For instance, if you look at the eToro, one the biggest Social Trading Platform.
Democratization Social trading into Digital Banking using ML - K Nearest Neighbors
Social trading is an alternative way of trading by looking at what other traders are doing and comparing and copying their techniques and strategies. Social trading allows traders to trade online with the help of others and some have claimed shortens the learning curve from novice to experienced trader. By copying trades, traders can learn which strategies work and which do not work. Social trading is used to do speculation; in the moral context speculative practices are considered negatively and to be avoided by each individual who conversely should maintain a long term horizon avoiding any types of short term speculation. For instance, if you look at the eToro, one the biggest Social Trading Platform.
KNN visualization in just 13 lines of code
Let's play around with datasets to visualize how the decision boundary changes as'k' changes. Let's have a quick review… K Nearest Neighbor(KNN) algorithm is a very simple, easy to understand, versatile and one of the topmost machine learning algorithms. In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbours, with the object being assigned to the class most common among its k nearest neighbours (k is a positive integer, typically small). If k 1, then the object is simply assigned to the class of that single nearest neighbour.