Nearest Neighbor Methods
FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging
Guo, Han, Rajani, Nazneen Fatema, Hase, Peter, Bansal, Mohit, Xiong, Caiming
Influence functions approximate the 'influences' of training data-points for test predictions and have a wide variety of applications. Despite the popularity, their computational cost does not scale well with model and training data size. We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time. We use k-Nearest Neighbors (kNN) to narrow the search space down to a subset of good candidate data points, identify the configurations that best balance the speed-quality trade-off in estimating the inverse Hessian-vector product, and introduce a fast parallel variant. Our proposed method achieves about 80x speedup while being highly correlated with the original influence values. With the availability of the fast influence functions, we demonstrate their usefulness in four applications. First, we examine whether influential data-points can 'explain' test time behavior using the framework of simulatability. Second, we visualize the influence interactions between training and test data-points. Third, we show that we can correct model errors by additional fine-tuning on certain influential data-points, improving the accuracy of a trained MNLI model by 2.6% on the HANS challenge set using a small number of gradient updates. Finally, we experiment with a data-augmentation setup where we use influence functions to search for new data-points unseen during training to improve model performance. Overall, our fast influence functions can be efficiently applied to large models and datasets, and our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors. Code is available at https://github.com/salesforce/fast-influence-functions
Bandit-Based Monte Carlo Optimization for Nearest Neighbors
Bagaria, Vivek, Baharav, Tavor Z., Kamath, Govinda M., Tse, David N.
The celebrated Monte Carlo method estimates an expensive-to-compute quantity by random sampling. Bandit-based Monte Carlo optimization is a general technique for computing the minimum of many such expensive-to-compute quantities by adaptive random sampling. The technique converts an optimization problem into a statistical estimation problem which is then solved via multi-armed bandits. We apply this technique to solve the problem of high-dimensional k-nearest neighbors, developing an algorithm which we prove is able to identify exact nearest neighbors with high probability. We show that under regularity assumptions on a dataset of n points in d-dimensional space, the complexity of our algorithm scales logarithmically with the dimension of the data as $O((n+d)\log^2 (\frac{nd}{\delta}))$ for error probability $\delta$, rather than linearly as in exact computation requiring O(nd). We corroborate our theoretical results with numerical simulations, showing that our algorithm outperforms both exact computation and state-of-the-art algorithms such as kGraph, NGT, and LSH on real datasets.
How to use Machine Learning for Anomaly Detection and Conditional Monitoring - KDnuggets
Before doing any data analysis, the need to find out any outliers in a dataset arises. These outliers are known as anomalies. This article explains the goals of anomaly detection and outlines the approaches used to solve specific use cases for anomaly detection and condition monitoring. The main goal of Anomaly Detection analysis is to identify the observations that do not adhere to general patterns considered as normal behavior. For instance, Figure 1 shows anomalies in the classification and regression problems.
Balancing Geometry and Density: Path Distances on High-Dimensional Data
Little, Anna, McKenzie, Daniel, Murphy, James
New geometric and computational analyses of power-weighted shortest-path distances (PWSPDs) are presented. By illuminating the way these metrics balance density and geometry in the underlying data, we clarify their key parameters and discuss how they may be chosen in practice. Comparisons are made with related data-driven metrics, which illustrate the broader role of density in kernel-based unsupervised and semi-supervised machine learning. Computationally, we relate PWSPDs on complete weighted graphs to their analogues on weighted nearest neighbor graphs, providing high probability guarantees on their equivalence that are near-optimal. Connections with percolation theory are developed to establish estimates on the bias and variance of PWSPDs in the finite sample setting. The theoretical results are bolstered by illustrative experiments, demonstrating the versatility of PWSPDs for a wide range of data settings. Throughout the paper, our results require only that the underlying data is sampled from a low-dimensional manifold, and depend crucially on the intrinsic dimension of this manifold, rather than its ambient dimension.
KNN Classification with One-step Computation
KNN classification is a query triggered yet improvisational learning mode, in which they are carried out only when a test data is predicted that set a suitable K value and search the K nearest neighbors from the whole training sample space, referred them to the lazy part of KNN classification. This lazy part has been the bottleneck problem of applying KNN classification. In this paper, a one-step computation is proposed to replace the lazy part of KNN classification. The one-step computation actually transforms the lazy part to a matrix computation as follows. Given a test data, training samples are first applied to fit the test data with the least squares loss function. And then, a relationship matrix is generated by weighting all training samples according to their influence on the test data. Finally, a group lasso is employed to perform sparse learning of the relationship matrix. In this way, setting K value and searching K nearest neighbors are both integrated to a unified computation. In addition, a new classification rule is proposed for improving the performance of one-step KNN classification. The proposed approach is experimentally evaluated, and demonstrated that the one-step KNN classification is efficient and promising.
5 Steps to Build a KNN Classifier
The k-nearest neighbor algorithm is applied to different classification and regression problems. The closest k training samples are used to predict the class of new input data, i.e., the most similar samples already known are used to classify an unknown data sample. Since the sci-kit library provides all the necessary tools to work on this algorithm, you can use these 5 steps to build your own KNN classifier in Python! As usual, start with importing all necessary libraries needed. This command builds an easy to handle data frame and decreases the complexity of working on the data set.
How to Choose the Best Nearest Neighbors Algorithm
In my previous post [KNN is Dead!], I have compared an ANN algorithm called HNSW with sklearn's KNN and proved that HNSW has vastly superior performance with a 380X speed up while delivering 99.3% of the same results. As a data scientist, I am a huge proponent of making data-driven decisions, as I mentioned in How to Choose the Best Keras Pre-Trained Model. So, in this post, I'll demonstrate a data-driven way to decide which ANN algorithm is the best choice for your custom dataset by using the excellent ann-benchmarks GitHub repository. The ann-benchmarks code compares multiple ANN algorithms by plotting each algorithm's Recall vs Queries per second.
Similarity measure for aggregated fuzzy numbers from interval-valued data
Gunn, Justin Kane, Khorshidi, Hadi Akbarzadeh, Aickelin, Uwe
Areas covering algorithms that commonly require measurements of similarity within data include classification, ranking, decision-making and pattern-matching. A similarity measure can effectively substitute for a distance measure (e.g. Euclidean distance), making data types with defined similarity measures compatible with methods such as K-Nearest Neighbour [1, 2] and TOPSIS [3, 4, 5]. This study proposes a similarity measure for aggregate fuzzy numbers constructed from interval-valued data using the Interval Agreement Approach (IAA), that is when given two such fuzzy numbers the degree of similarity regarding them is computed. The experimental interval-valued data in recent literature is often an alternative representation of expert opinion.
Radar Artifact Labeling Framework (RALF): Method for Plausible Radar Detections in Datasets
Isele, Simon T., Schilling, Marcel P., Klein, Fabian E., Saralajew, Sascha, Zoellner, J. Marius
Research on localization and perception for Autonomous Driving is mainly focused on camera and LiDAR datasets, rarely on radar data. Manually labeling sparse radar point clouds is challenging. For a dataset generation, we propose the cross sensor Radar Artifact Labeling Framework (RALF). Automatically generated labels for automotive radar data help to cure radar shortcomings like artifacts for the application of artificial intelligence. RALF provides plausibility labels for radar raw detections, distinguishing between artifacts and targets. The optical evaluation backbone consists of a generalized monocular depth image estimation of surround view cameras plus LiDAR scans. Modern car sensor sets of cameras and LiDAR allow to calibrate image-based relative depth information in overlapping sensing areas. K-Nearest Neighbors matching relates the optical perception point cloud with raw radar detections. In parallel, a temporal tracking evaluation part considers the radar detections' transient behavior. Based on the distance between matches, respecting both sensor and model uncertainties, we propose a plausibility rating of every radar detection. We validate the results by evaluating error metrics on semi-manually labeled ground truth dataset of $3.28\cdot10^6$ points. Besides generating plausible radar detections, the framework enables further labeled low-level radar signal datasets for applications of perception and Autonomous Driving learning tasks.
Machine Learning -- K-Nearest Neighbors algorithm with Python
'K-Nearest Neighbors (KNN) is a model that classifies data points based on the points that are most similar to it. It uses test data to make an "educated guess" on what an unclassified point should be classified as' We will be building our KNN model using python's most popular machine learning package'scikit-learn'. Scikit-learn provides data scientists with various tools for performing machine learning tasks. For our KNN model, we are going to use the'KNeighborsClassifier' algorithm which is readily available in scikit-learn package. Finally, we will evaluate our KNN model predictions using the'accuracy score' function in scikit-learn.