Nearest Neighbor Methods
An adaptive nearest neighbor rule for classification
Balsubramani, Akshay, Dasgupta, Sanjoy, Freund, Yoav, Moran, Shay
We introduce a variant of the $k$-nearest neighbor classifier in which $k$ is chosen adaptively for each query, rather than supplied as a parameter. The choice of $k$ depends on properties of each neighborhood, and therefore may significantly vary between different points. (For example, the algorithm will use larger $k$ for predicting the labels of points in noisy regions.) We provide theory and experiments that demonstrate that the algorithm performs comparably to, and sometimes better than, $k$-NN with an optimal choice of $k$. In particular, we derive bounds on the convergence rates of our classifier that depend on a local quantity we call the `advantage' which is significantly weaker than the Lipschitz conditions used in previous convergence rate proofs. These generalization bounds hinge on a variant of the seminal Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant concerns conditional probabilities and may be of independent interest.
Machine Learning: Building Recommender Systems
The scikit-learnthe library has functions that enable us to build these pipelines by concatenating various modules together. We just need to specify the modules along with the corresponding parameters. It will then build a pipeline using these modules that processes the data and trains the system. The pipeline can include modules that perform various functions like feature selection, preprocessing, random forests, clustering, and so on. In this section, we will see how to build a pipeline to select the top K features from an input data point and then classify them using an Extremely Random Forest classifier.
Ignorance-Aware Approaches and Algorithms for Prototype Selection in Machine Learning
Terziyan, Vagan, Nikulin, Anton
Operating with ignorance is an important concern of the Machine Learning research, especially when the objective is to discover knowledge from the imperfect data. Data mining (driven by appropriate knowledge discovery tools) is about processing available (observed, known and understood) samples of data aiming to build a model (e.g., a classifier) to handle data samples, which are not yet observed, known or understood. These tools traditionally take samples of the available data (known facts) as an input for learning. We want to challenge the indispensability of this approach and we suggest considering the things the other way around. What if the task would be as follows: how to learn a model based on our ignorance, i.e. by processing the shape of 'voids' within the available data space? Can we improve traditional classification by modeling also the ignorance? In this paper, we provide some algorithms for the discovery and visualizing of the ignorance zones in two-dimensional data spaces and design two ignorance-aware smart prototype selection techniques (incremental and adversarial) to improve the performance of the nearest neighbor classifiers. We present experiments with artificial and real datasets to test the concept of the usefulness of ignorance discovery in machine learning.
Adversarial Training with Voronoi Constraints
Khoury, Marc, Hadfield-Menell, Dylan
Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models. We propose a geometric framework, drawing on tools from the manifold reconstruction literature, to analyze the high-dimensional geometry of adversarial examples. In particular, we highlight the importance of codimension: for low-dimensional data manifolds embedded in high-dimensional space there are many directions off the manifold in which an adversary could construct adversarial examples. Adversarial examples are a natural consequence of learning a decision boundary that classifies the low-dimensional data manifold well, but classifies points near the manifold incorrectly. Using our geometric framework we prove that adversarial training is sample inefficient, and show sufficient sampling conditions under which nearest neighbor classifiers and ball-based adversarial training are robust. Finally we introduce adversarial training with Voronoi constraints, which replaces the norm ball constraint with the Voronoi cell for each point in the training set. We show that adversarial training with Voronoi constraints produces robust models which significantly improve over the state-of-the-art on MNIST and are competitive on CIFAR-10.
Indoor Localization for IoT Using Adaptive Feature Selection: A Cascaded Machine Learning Approach
AlHajri, Mohamed I., Ali, Nazar T., Shubair, Raed M.
Evolving Internet-of-Things (IoT) applications often require the use of sensor-based indoor tracking and positioning, for which the performance is significantly improved by identifying the type of the surrounding indoor environment. This identification is of high importance since it leads to higher localization accuracy. This paper presents a novel method based on a cascaded two-stage machine learning approach for highly-accurate and robust localization in indoor environments using adaptive selection and combination of RF features. In the proposed method, machine learning is first used to identify the type of the surrounding indoor environment. Then, in the second stage, machine learning is employed to identify the most appropriate selection and combination of RF features that yield the highest localization accuracy. Analysis is based on k-Nearest Neighbor (k-NN) machine learning algorithm applied on a real dataset generated from practical measurements of the RF signal in realistic indoor environments. Received Signal Strength, Channel Transfer Function, and Frequency Coherence Function are the primary RF features being explored and combined. Numerical investigations demonstrate that prediction based on the concatenation of primary RF features enhanced significantly as the localization accuracy improved by at least 50% to more than 70%.
Stochastic Optimization of Sorting Networks via Continuous Relaxations
Grover, Aditya, Wang, Eric, Zweig, Aaron, Ermon, Stefano
Sorting input objects is an important step in many machine learning pipelines. However, the sorting operator is non-differentiable with respect to its inputs, which prohibits end-to-end gradient-based optimization. In this work, we propose NeuralSort, a general-purpose continuous relaxation of the output of the sorting operator from permutation matrices to the set of unimodal row-stochastic matrices, where every row sums to one and has a distinct arg max. This relaxation permits straight-through optimization of any computational graph involve a sorting operation. Further, we use this relaxation to enable gradient-based stochastic optimization over the combinatorially large space of permutations by deriving a reparameterized gradient estimator for the Plackett-Luce family of distributions over permutations. We demonstrate the usefulness of our framework on three tasks that require learning semantic orderings of high-dimensional objects, including a fully differentiable, parameterized extension of the k-nearest neighbors algorithm. Learning to automatically sort objects is useful in many machine learning applications, such as topk multi-class classification (Berrada et al., 2018), ranking documents for information retrieval (Liu et al., 2009), and multi-object target tracking in computer vision (Bar-Shalom & Li, 1995). Such algorithms typically require learning informative representations of complex, high-dimensional data, such as images, before sorting and subsequent downstream processing. For instance, the k-nearest neighbors image classification algorithm, which orders the neighbors based on distances in the canonical pixel basis, can be highly suboptimal for classification (Weinberger et al., 2006). Deep neural networks can instead be used to learn representations, but these representations cannot be optimized end-to-end for a downstream sorting-based objective, since the sorting operator is not differentiable with respect to its input. In this work, we seek to remedy this shortcoming by proposing NeuralSort, a continuous relaxation to the sorting operator that is differentiable almost everywhere with respect to the inputs. The output of any sorting algorithm can be viewed as a permutation matrix, which is a square matrix with entries in {0, 1} such that every row and every column sums to 1. Instead of a permutation matrix, NeuralSort returns a unimodal row-stochastic matrix. A unimodal row-stochastic matrix is defined as a square matrix with positive real entries, where each row sums to 1 and has a distinct arg max.
Higher Accurate Recognition of Handwritten Pashto Letters through Zoning Feature by using K-Nearest Neighbour and Artificial Neural Network
Khan, Sulaiman, Ali, Hazrat, Ullah, Zahid, Minallah, Nasru, Maqsood, Shahid, Hafeez, Abdul
This paper presents a recognition system for handwritten Pashto letters. However, handwritten character recognition is a challenging task. These letters not only differ in shape and style but also vary among individuals. The recognition becomes further daunting due to the lack of standard datasets for inscribed Pashto letters. In this work, we have designed a database of moderate size, which encompasses a total of 4488 images, stemming from 102 distinguishing samples for each of the 44 letters in Pashto. The recognition framework uses zoning feature extractor followed by K-Nearest Neighbour (KNN) and Neural Network (NN) classifiers for classifying individual letter. Based on the evaluation of the proposed system, an overall classification accuracy of approximately 70.05% is achieved by using KNN while 72% is achieved by using NN.
SANNS: Scaling Up Secure Approximate k-Nearest Neighbors Search
Chen, Hao, Chillotti, Ilaria, Dong, Yihe, Poburinnaya, Oxana, Razenshteyn, Ilya, Riazi, M. Sadegh
We present new secure protocols for approximate $k$-nearest neighbor search ($k$-NNS) over the Euclidean distance in the semi-honest model. Our implementation is able to handle massive datasets efficiently. On the algorithmic front, we show a new circuit for the approximate top-$k$ selection from $n$ numbers that is built from merely $O(n + \mathrm{poly}(k))$ comparators. Using this circuit as a subroutine, we design new approximate $k$-NNS algorithms and two corresponding secure protocols: 1) optimized linear scan; 2) clustering-based sublinear time algorithm. Our secure protocols utilize a combination of additively-homomorphic encryption, garbled circuit and Oblivious RAM. Along the way, we introduce various optimizations to these primitives, which drastically improve concrete efficiency. We evaluate the new protocols empirically and show that they are able to handle datasets that are significantly larger than in the prior work. For instance, running on two standard Azure instances within the same availability zone, for a dataset of $96$-dimensional descriptors of $10\,000\,000$ images, we can find $10$ nearest neighbors with average accuracy $0.9$ in under $10$ seconds improving upon prior work by at least two orders of magnitude.
Unveiling phase transitions with machine learning
Canabarro, Askery, Fanchini, Felipe Fernandes, Malvezzi, Andrรฉ Luiz, Pereira, Rodrigo, Chaves, Rafael
The classification of phase transitions is a central and challenging task in condensed matter physics. Typically, it relies on the identification of order parameters and the analysis of singularities in the free energy and its derivatives. Here, we propose an alternative framework to identify quantum phase transitions, employing both unsupervised and supervised machine learning techniques. Using the axial next-nearest neighbor Ising (ANNNI) model as a benchmark, we show how unsupervised learning can detect three phases (ferromagnetic, paramagnetic, and a cluster of the antiphase with the floating phase) as well as two distinct regions within the paramagnetic phase. Employing supervised learning we show that transfer learning becomes possible: a machine trained only with nearest-neighbour interactions can learn to identify a new type of phase occurring when next-nearest-neighbour interactions are introduced. All our results rely on few and low dimensional input data (up to twelve lattice sites), thus providing a computational friendly and general framework for the study of phase transitions in many-body systems.
Comparison of Possibilistic Fuzzy Local Information C-Means and Possibilistic K-Nearest Neighbors for Synthetic Aperture Sonar Image Segmentation
Peeples, Joshua, Cook, Matthew, Suen, Daniel, Zare, Alina, Keller, James
Synthetic aperture sonar (SAS) imagery can generate high resolution images of the seafloor. Thus, segmentation algorithms can be used to partition the images into different seafloor environments. In this paper, we compare two possibilistic segmentation approaches. Possibilistic approaches allow for the ability to detect novel or outlier environments as well as well known classes. The Possibilistic Fuzzy Local Information C-Means (PFLICM) algorithm has been previously applied to segment SAS imagery. Additionally, the Possibilistic K-Nearest Neighbors (PKNN) algorithm has been used in other domains such as landmine detection and hyperspectral imagery. In this paper, we compare the segmentation performance of a semi-supervised approach using PFLICM and a supervised method using Possibilistic K-NN. We include final segmentation results on multiple SAS images and a quantitative assessment of each algorithm.