Nearest Neighbor Methods
Defending Against Adversarial Examples with K-Nearest Neighbor
Robustness is an increasingly important property of machine learning models as they become more and more prevalent. We propose a defense against adversarial examples based on a k-nearest neighbor (kNN) on the intermediate activation of neural networks. With our models, the mean perturbation norm required to fool our MNIST model is 3.07 and 2.30 on CIFAR-10. Additionally, we propose a simple certifiable lower bound on the l2-norm of the adversarial perturbation using a more specific version of our scheme, a 1-NN on representations learned by a Lipschitz network. Our model provides a nontrivial average lower bound of the perturbation norm, comparable to other schemes on MNIST with similar clean accuracy.
Defending Against Adversarial Examples with K-Nearest Neighbor
Robustness is an increasingly important property of machine learning models as they become more and more prevalent. We propose a defense against adversarial examples based on a k-nearest neighbor (kNN) on the intermediate activation of neural networks. With our models, the mean perturbation norm required to fool our MNIST model is 3.07 and 2.30 on CIFAR-10. Additionally, we propose a simple certifiable lower bound on the l2-norm of the adversarial perturbation using a more specific version of our scheme, a 1-NN on representations learned by a Lipschitz network. Our model provides a nontrivial average lower bound of the perturbation norm, comparable to other schemes on MNIST with similar clean accuracy.
DLIME: A Deterministic Local Interpretable Model-Agnostic Explanations Approach for Computer-Aided Diagnosis Systems
Zafar, Muhammad Rehman, Khan, Naimul Mefraz
Local Interpretable Model-Agnostic Explanations (LIME) is a popular technique used to increase the interpretability and explainability of black box Machine Learning (ML) algorithms. LIME typically generates an explanation for a single prediction by any ML model by learning a simpler interpretable model (e.g. linear classifier) around the prediction through generating simulated data around the instance by random perturbation, and obtaining feature importance through applying some form of feature selection. While LIME and similar local algorithms have gained popularity due to their simplicity, the random perturbation and feature selection methods result in "instability" in the generated explanations, where for the same prediction, different explanations can be generated. This is a critical issue that can prevent deployment of LIME in a Computer-Aided Diagnosis (CAD) system, where stability is of utmost importance to earn the trust of medical professionals. In this paper, we propose a deterministic version of LIME. Instead of random perturbation, we utilize agglomerative Hierarchical Clustering (HC) to group the training data together and K-Nearest Neighbour (KNN) to select the relevant cluster of the new instance that is being explained. After finding the relevant cluster, a linear model is trained over the selected cluster to generate the explanations. Experimental results on three different medical datasets show the superiority for Deterministic Local Interpretable Model-Agnostic Explanations (DLIME), where we quantitatively determine the stability of DLIME compared to LIME utilizing the Jaccard similarity among multiple generated explanations.
7 Steps to Mastering Intermediate Machine Learning with Python -- 2019 Edition
Are you interested in learning more about machine learning with Python? I recently wrote 7 Steps to Mastering Basic Machine Learning with Python -- 2019 Edition, a first step in an attempt to updated a pair of posts I wrote some time back (7 Steps to Mastering Machine Learning With Python and 7 More Steps to Mastering Machine Learning With Python), a pair of posts which are getting stale at this point, having been around for a few years. It's time to add on to the "basic" post with a set of steps for learning "intermediate" level machine learning with Python. We're talking "intermediate" in a relative sense, however, so do not expect to be a research-caliber machine learning engineer after getting through this post. The learning path is aimed at those with some understanding of programming, computer science concepts, and/or machine learning in an abstract sense, who are wanting to be able to use the implementations of machine learning algorithms of the prevalent Python libraries to build their own machine learning models.
7 Steps to Mastering Intermediate Machine Learning with Python -- 2019 Edition
Are you interested in learning more about machine learning with Python? I recently wrote 7 Steps to Mastering Basic Machine Learning with Python -- 2019 Edition, a first step in an attempt to updated a pair of posts I wrote some time back (7 Steps to Mastering Machine Learning With Python and 7 More Steps to Mastering Machine Learning With Python), a pair of posts which are getting stale at this point, having been around for a few years. It's time to add on to the "basic" post with a set of steps for learning "intermediate" level machine learning with Python. We're talking "intermediate" in a relative sense, however, so do not expect to be a research-caliber machine learning engineer after getting through this post. The learning path is aimed at those with some understanding of programming, computer science concepts, and/or machine learning in an abstract sense, who are wanting to be able to use the implementations of machine learning algorithms of the prevalent Python libraries to build their own machine learning models.
An enhanced KNN-based twin support vector machine with stable learning rules
Among the extensions of twin support vector machine (TSVM), some scholars have utilized K-nearest neighbor (KNN) graph to enhance TSVM's classification accuracy. However, these KNN-based TSVM classifiers have two major issues such as high computational cost and overfitting. In order to address these issues, this paper presents an enhanced regularized K-nearest neighbor based twin support vector machine (RKNN-TSVM). It has three additional advantages: (1) Weight is given to each sample by considering the distance from its nearest neighbors. This further reduces the effect of noise and outliers on the output model. (2) An extra stabilizer term was added to each objective function. As a result, the learning rules of the proposed method are stable. (3) To reduce the computational cost of finding KNNs for all the samples, location difference of multiple distances based k-nearest neighbors algorithm (LDMDBA) was embedded into the learning process of the proposed method. The extensive experimental results on several synthetic and benchmark datasets show the effectiveness of our proposed RKNN-TSVM in both classification accuracy and computational time. Moreover, the largest speedup in the proposed method reaches to 14 times.
Defending Against Adversarial Examples with K-Nearest Neighbor
Sitawarin, Chawin, Wagner, David
Robustness is an increasingly important property of machine learning models as they become more and more prevalent. We propose a defense against adversarial examples based on a k-nearest neighbor (kNN) on the intermediate activation of neural networks. Our scheme surpasses state-of-the-art defenses on MNIST and CIFAR-10 against l2-perturbation by a significant margin. With our models, the mean perturbation norm required to fool our MNIST model is 3.07 and 2.30 on CIFAR-10. Additionally, we propose a simple certifiable lower bound on the l2-norm of the adversarial perturbation using a more specific version of our scheme, a 1-NN on representations learned by a Lipschitz network. Our model provides a nontrivial average lower bound of the perturbation norm, comparable to other schemes on MNIST with similar clean accuracy.
Solver Recommendation For Transport Problems in Slabs Using Machine Learning
Chen, Jinzhao, Patel, Japan K., Vasques, Richard
The use of machine learning algorithms to address classification problems is on the rise in many research areas. The current study is aimed at testing the potential of using such algorithms to auto-select the best solvers for transport problems in uniform slabs. Three solvers are used in this work: Richardson, diffusion synthetic acceleration, and nonlinear diffusion acceleration. Three parameters are manipulated to create different transport problem scenarios. Five machine learning algorithms are applied: linear discriminant analysis, K-nearest neighbors, support vector machine, random forest, and neural networks. We present and analyze the results of these algorithms for the test problems, showing that random forest and K-nearest neighbors are potentially the best suited candidates for this type of classification problem.
Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier -- A Review
Prasath, V. B. Surya, Alfeilat, Haneen Arafat Abu, Lasassmeh, Omar, Hassanat, Ahmad B. A., Tarawneh, Ahmad S.
The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested example and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures? This review attempts to answer the previous question through evaluating the performance (measured by accuracy, precision and recall) of the KNN using a large number of distance measures, tested on a number of real world datasets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, the results showed large gaps between the performances of different distances. We found that a recently proposed non-convex distance performed the best when applied on most datasets comparing to the other tested distances. In addition, the performance of the KNN degraded only about $20\%$ while the noise level reaches $90\%$, this is true for all the distances used. This means that the KNN classifier using any of the top $10$ distances tolerate noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing to other distances.
Intrinsic dimension estimation for locally undersampled data
Erba, Vittorio, Gherardi, Marco, Rotondo, Pietro
High-dimensional data are ubiquitous in contemporary science and finding methods to compress them is one of the primary goals of machine learning. Given a dataset lying in a high-dimensional space (in principle hundreds to several thousands of dimensions), it is often useful to project it onto a lower-dimensional manifold, without loss of information. Identifying the minimal dimension of such manifold is a challenging problem known in the literature as intrinsic dimension estimation (IDE). Traditionally, most IDE algorithms are either based on multiscale principal component analysis (PCA) or on the notion of correlation dimension (and more in general on k-nearest-neighbors distances). These methods are affected, in different ways, by a severe curse of dimensionality. In particular, none of the existing algorithms can provide accurate ID estimates in the extreme locally undersampled regime, i.e. in the limit where the number of samples in any local patch of the manifold is less than (or of the same order of) the ID of the dataset. Here we introduce a new ID estimator that leverages on simple properties of the tangent space of a manifold to overcome these shortcomings. The method is based on the full correlation integral, going beyond the limit of small radius used for the estimation of the correlation dimension. Our estimator alleviates the extreme undersampling problem, intractable with other methods. Based on this insight, we explore a multiscale generalization of the algorithm. We show that it is capable of (i) identifying multiple dimensionalities in a dataset, and (ii) providing accurate estimates of the ID of extremely curved manifolds. In particular, we test the method on manifolds generated from global transformations of high-contrast images, relevant for invariant object recognition and considered a challenge for state-of-the-art ID estimators.