Nearest Neighbor Methods
Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification
We show that conventional k-nearest neighbor classification can be viewed as a special problem of the diffusion decision model in the asymptotic situation. By applying the optimal strategy associated with the diffusion decision model, an adaptive rule is developed for determining appropriate values of k in k-nearest neighbor classification. Making use of the sequential probability ratio test (SPRT) and Bayesian analysis, we propose five different criteria for adaptively acquiring nearest neighbors. Experiments with both synthetic and real datasets demonstrate the effectiveness of our classification criteria.
Non-linear Metric Learning
How to compare examples is a fundamental question in machine learning. If an algorithm could perfectly determine whether two examples were semantically similar or dissimilar, most subsequent machine learning tasks would become trivial (i.e., a nearest neighbor classifier will achieve perfect results). Guided by this motivation, a surge of recent research [10, 13, 15, 24, 31, 32] has focused on Mahalanobis metric learning. The resulting methods greatly improve the performance of metric dependent algorithms, such as k-means clustering and kNN classification, and have gained popularity in many research areas and applications within and beyond machine learning. One reason for this success is the out-of-the-box usability and robustness of several popular methods to learn these linear metrics.
Rapid Distance-Based Outlier Detection via Sampling
Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search.
8f468c873a32bb0619eaeb2050ba45d1-Reviews.html
Summary: This paper presents a multitask learning method which entails jointly solving a collection of k-nearest neighbor (kNN) based prediction tasks, leveraging the relationships among the tasks. Whereas single task learning for kNN would only consider neighbors from the task which the test point belongs to (referred to as "homogeneous neighborhood" in the paper), the multitask variant proposed here considers neighbors from all tasks (referred to as "heterogeneous neighborhood" in the paper), suitably weighting the contribution of each neighbor by the pairwise similarity between the task the test point belongs o and the task the neighbor belongs to. The pairwise task similarities are learned from data. Experimental results show that the proposed method performs better than a kNN based multitask learning method anda global multitask learning method that learns a common feature represent of all tasks and learns predictors using that representation. Quality: The proposed model makes sense, especially the way a local learning problem (neighborhood based kNN) has been reformulated as a global learning problem (like SVM) and then cast as a standard global multitask learning problem. Clarity: The paper is well-written and the idea is easy to follow.
Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multitask classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the k-nearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods.
Near-optimal Anomaly Detection in Graphs using Lovรกsz Extended Scan Statistic
The detection of anomalous activity in graphs is a statistical problem that arises in many applications, such as network surveillance, disease outbreak detection, and activity monitoring in social networks. Beyond its wide applicability, graph structured anomaly detection serves as a case study in the difficulty of balancing computational complexity with statistical power. In this work, we develop from first principles the generalized likelihood ratio test for determining if there is a well connected region of activation over the vertices in the graph in Gaussian noise. Because this test is computationally infeasible, we provide a relaxation, called the Lovรกsz extended scan statistic (LESS) that uses submodularity to approximate the intractable generalized likelihood ratio. We demonstrate a connection between LESS and maximum a-posteriori inference in Markov random fields, which provides us with a poly-time algorithm for LESS. Using electrical network theory, we are able to control type 1 error for LESS and prove conditions under which LESS is risk consistent. Finally, we consider specific graph models, the torus, k-nearest neighbor graphs, and วซ-random graphs. We show that on these graphs our results provide near-optimal performance by matching our results to known lower bounds.
Discriminative Metric Learning by Neighborhood Gerrymandering
We formulate the problem of metric learning for k nearest neighbor classification as a large margin structured prediction problem, with a latent variable representing the choice of neighbors and the task loss directly corresponding to classification error. We describe an efficient algorithm for exact loss augmented inference, and a fast gradient descent algorithm for learning in this model. The objective drives the metric to establish neighborhood boundaries that benefit the true class labels for the training points. Our approach, reminiscent of gerrymandering (redrawing of political boundaries to provide advantage to certain parties), is more direct in its handling of optimizing classification accuracy than those previously proposed. In experiments on a variety of data sets our method is shown to achieve excellent results compared to current state of the art in metric learning.
Near-optimal sample compression for nearest neighbors
We present the first sample compression algorithm for nearest neighbors with nontrivial performance guarantees. We complement these guarantees by demonstrating almost matching hardness lower bounds, which show that our bound is nearly optimal. Our result yields new insight into margin-based nearest neighbor classification in metric spaces and allows us to significantly sharpen and simplify existing bounds. Some encouraging empirical results are also presented.
On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets
Joel, Luke Oluwaseye, Doorsamy, Wesley, Paul, Babu Sena
Missing values or data is one popular characteristic of real-world datasets, especially healthcare data. This could be frustrating when using machine learning algorithms on such datasets, simply because most machine learning models perform poorly in the presence of missing values. The aim of this study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE), on three healthcare datasets. Some percentage of missing values - 10\%, 15\%, 20\% and 25\% - were introduced into the dataset, and the imputation techniques were employed to impute these missing values. The comparison of their performance was evaluated by using root mean squared error (RMSE) and mean absolute error (MAE). The results show that Missforest imputation performs the best followed by MICE imputation. Additionally, we try to determine whether it is better to perform feature selection before imputation or vice versa by using the following metrics - the recall, precision, f1-score and accuracy. Due to the fact that there are few literature on this and some debate on the subject among researchers, we hope that the results from this experiment will encourage data scientists and researchers to perform imputation first before feature selection when dealing with data containing missing values.