Goto

Collaborating Authors

 Nearest Neighbor Methods


Provably Robust Metric Learning

arXiv.org Machine Learning

Metric learning has been an important family of machine learning algorithms and has achieved successes on several problems, including computer vision [24, 17, 18], text analysis [27], meta learning [38, 35] and others [34, 45, 47]. Given a set of training samples, metric learning aims to learn a good distance measurement such that items in the same class are closer to each other in the learned metric space, which is crucial for classification and similarity search. Since this objective is directly related to the assumption of nearest neighbor classifiers, most of the metric learning algorithms can be naturally and successfully combined with K-Nearest Neighbor (K-NN) classifiers. Adversarial robustness of machine learning algorithms has been studied extensively in recent years due to the need of robustness guarantees in real world systems. It has been demonstrated that neural networks can be easily attacked by adversarial perturbations in the input space [37, 16, 2], and such perturbations can be computed efficiently in both white-box [4, 29] and black-box settings [7, 19, 9]. Therefore, many defense algorithms have been proposed to improve the robustness of neural networks [26, 29].


Smartphone Transportation Mode Recognition Using a Hierarchical Machine Learning Classifier and Pooled Features From Time and Frequency Domains

arXiv.org Machine Learning

This paper develops a novel two-layer hierarchical classifier that increases the accuracy of traditional transportation mode classification algorithms. This paper also enhances classification accuracy by extracting new frequency domain features. Many researchers have obtained these features from global positioning system data; however, this data was excluded in this paper, as the system use might deplete the smartphone's battery and signals may be lost in some areas. Our proposed two-layer framework differs from previous classification attempts in three distinct ways: 1) the outputs of the two layers are combined using Bayes' rule to choose the transportation mode with the largest posterior probability; 2) the proposed framework combines the new extracted features with traditionally used time domain features to create a pool of features; and 3) a different subset of extracted features is used in each layer based on the classified modes. Several machine learning techniques were used, including k-nearest neighbor, classification and regression tree, support vector machine, random forest, and a heterogeneous framework of random forest and support vector machine. Results show that the classification accuracy of the proposed framework outperforms traditional approaches. Transforming the time domain features to the frequency domain also adds new features in a new space and provides more control on the loss of information. Consequently, combining the time domain and the frequency domain features in a large pool and then choosing the best subset results in higher accuracy than using either domain alone. The proposed two-layer classifier obtained a maximum classification accuracy of 97.02%.


Towards Certified Robustness of Metric Learning

arXiv.org Machine Learning

Metric learning aims to learn a distance metric such that semantically similar instances are pulled together while dissimilar instances are pushed away. Many existing methods consider maximizing or at least constraining a distance "margin" that separates similar and dissimilar pairs of instances to guarantee their performance on a subsequent k-nearest neighbor classifier. However, such a margin in the feature space does not necessarily lead to robustness certification or even anticipated generalization advantage, since a small perturbation of test instance in the instance space could still potentially alter the model prediction. To address this problem, we advocate penalizing small distance between training instances and their nearest adversarial examples, and we show that the resulting new approach to metric learning enjoys a larger certified neighborhood with theoretical performance guarantee. Moreover, drawing on an intuitive geometric insight, the proposed new loss term permits an analytically elegant closed-form solution and offers great flexibility in leveraging it jointly with existing metric learning methods. Extensive experiments demonstrate the superiority of the proposed method over the state-of-the-arts in terms of both discrimination accuracy and robustness to noise.


How to Scale Data With Outliers for Machine Learning

#artificialintelligence

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors. Standardizing is a popular scaling technique that subtracts the mean from values and divides by the standard deviation, transforming the probability distribution for an input variable to a standard Gaussian (zero mean and unit variance). Standardization can become skewed or biased if the input variable contains outlier values. To overcome this, the median and interquartile range can be used when standardizing numerical input variables, generally referred to as robust scaling.


On-Device Training with Core ML - Make Your Pancakes Healthy Again!

#artificialintelligence

Backing up the model The model stays on the device, which is great. They will lose the new version of the model unless we take care of that by sending it somewhere and later downloading it. Adding a new version of the model If the model stays and retrains on a device, what if we want to change it for a new model, let's say an improved one (not personalized)? If we do that, the user will also lose all the personalized parts of the model and will need to start from scratch. Usually we support those earlier versions too.


A Preliminary Study of Spatial Bias in Knn Distance Metrics

AAAI Conferences

A machine learning algorithm for image classification exhibits spatial bias if permuting the order of image pixels significantly alters its classification accuracy. In this paper, we explore the spatial bias of a number of different distance metrics for k-nearest-neighbor image classification. One distance metric is inspired by the convolutional kernels employed in convolutional neural networks. The other metrics are based on BRIEF descriptors, which generate bit vectors corresponding to images based on comparisons of pixel intensity values. We found that the convolutional distance metric exhibited a strong positive spatial bias, as did one of the BRIEF descriptors. Another BRIEF descriptor exhibited a negative spatial bias, and the remainder exhibited little or no spatial bias. These results lay a foundation for future work that would involve larger numbers of convolutional iterations, potentially synergized with BRIEF-style image preprocessing.


Case-Based Reasoning for the Analysis of Methylation Data in Oncology

AAAI Conferences

Researchers seek to identify biological markers which accurately differentiate cancer subtypes and their severity from normal controls. One such biomarker, DNA methylation, has recently become more prevalent in genetic research studies in oncology. This paper proposes to apply these findings in a study of the diagnostic accuracy of DNA methylation signatures for classifying metastasis samples. Very high classification performance measures were obtained from differentially methylated positions and regions, as well as from selected gene signatures. Perfect accuracy was achieved with the top 5 feature-selected genes using three similar cases and the K-nearest neighbor classfier. This work contributes to the path toward the identification of biological signatures for oncology samples using case-based reasoning.


A Weighted Mutual k-Nearest Neighbour for Classification Mining

arXiv.org Machine Learning

kNN is a very effective Instance based learning method, and it is easy to implement. Due to heterogeneous nature of data, noises from different possible sources are also widespread in nature especially in case of large-scale databases. For noise elimination and effect of pseudo neighbours, in this paper, we propose a new learning algorithm which performs the task of anomaly detection and removal of pseudo neighbours from the dataset so as to provide comparative better results. This algorithm also tries to minimize effect of those neighbours which are distant. A concept of certainty measure is also introduced for experimental results. The advantage of using concept of mutual neighbours and distance-weighted voting is that, dataset will be refined after removal of anomaly and weightage concept compels to take into account more consideration of those neighbours, which are closer. Consequently, finally the performance of proposed algorithm is calculated.


Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions

arXiv.org Machine Learning

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) -- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed -- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques in terms of classification accuracy with mild manual cleaning effort.


Mastering Machine Learning in Python

#artificialintelligence

Machine learning is the process of using features to predict an outcome measure. Machine learning plays an important role in many industries. A few examples include using machine learning for medical diagnoses, predicting stock prices, and ad promotion optimization. Machine learning employs methods of statistics, data mining, engineering, and many other disciplines. In machine learning, we use a training set of data, in which we observe past outcome and feature measurements, to build a model for prediction.