Goto

Collaborating Authors

 Nearest Neighbor Methods


Beginner's Guide to K-Nearest Neighbors in R: from Zero to Hero

#artificialintelligence

In the world of Machine Learning, I find the K-Nearest Neighbors (KNN) classifier makes the most intuitive sense and easily accessible to beginners even without introducing any math notations. To decide the label of an observation, we look at its neighbors and assign the neighbors' label to the observation of interest. Certainly, looking at one neighbor may create bias and inaccuracy, and the KNN method has a set of rules and procedures to determine the best number of neighbors, e.g., examining k 1 neighbors and adopt majority rule to decide the category. "To decide the label for new observations, we look at the closest neighbors." To choose the nearest neighbors, we have to define what distance is.


Most Popular Distance Metrics Used in KNN and When to Use Them - KDnuggets

#artificialintelligence

KNN is the most commonly used and one of the simplest algorithms for finding patterns in classification and regression problems. It is an unsupervised algorithm and also known as lazy learning algorithm. It works by calculating the distance of 1 test observation from all the observation of the training dataset and then finding K nearest neighbors of it. This happens for each and every test observation and that is how it finds similarities in the data. For calculating distances KNN uses a distance metric from the list of available metrics.


Manifold Partition Discriminant Analysis

arXiv.org Artificial Intelligence

We propose a novel algorithm for supervised dimensionality reduction named Manifold Partition Discriminant Analysis (MPDA). It aims to find a linear embedding space where the within-class similarity is achieved along the direction that is consistent with the local variation of the data manifold, while nearby data belonging to different classes are well separated. By partitioning the data manifold into a number of linear subspaces and utilizing the first-order Taylor expansion, MPDA explicitly parameterizes the connections of tangent spaces and represents the data manifold in a piecewise manner. While graph Laplacian methods capture only the pairwise interaction between data points, our method capture both pairwise and higher order interactions (using regional consistency) between data points. This manifold representation can help to improve the measure of within-class similarity, which further leads to improved performance of dimensionality reduction. Experimental results on multiple real-world data sets demonstrate the effectiveness of the proposed method.


Adversarial Examples for k-Nearest Neighbor Classifiers Based on Higher-Order Voronoi Diagrams

#artificialintelligence

Adversarial examples are a widely studied phenomenon in machine learning models. While most of the attention has been focused on neural networks, other practical models also suffer from this issue. In this work, we propose an algorithm for evaluating the adversarial robustness of k-nearest neighbor classification, i.e., finding a minimum-norm adversarial example. Diverging from previous proposals, we take a geometric approach by performing a search that expands outwards from a given input point. On a high level, the search radius expands to the nearby Voronoi cells until we find a cell that classifies differently from the input point.


Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on Higher-Order Voronoi Diagrams

arXiv.org Machine Learning

Adversarial examples are a widely studied phenomenon in machine learning models. While most of the attention has been focused on neural networks, other practical models also suffer from this issue. In this work, we propose an algorithm for evaluating the adversarial robustness of $k$-nearest neighbor classification, i.e., finding a minimum-norm adversarial example. Diverging from previous proposals, we take a geometric approach by performing a search that expands outwards from a given input point. On a high level, the search radius expands to the nearby Voronoi cells until we find a cell that classifies differently from the input point. To scale the algorithm to a large $k$, we introduce approximation steps that find perturbations with smaller norm, compared to the baselines, in a variety of datasets. Furthermore, we analyze the structural properties of a dataset where our approach outperforms the competition.


IAMPE: NMR-Assisted Computational Prediction of Antimicrobial Peptides

#artificialintelligence

Antimicrobial peptides (AMPs) are at the focus of attention due to their therapeutic importance and developing computational tools for the identification of efficient antibiotics from the primary structure. Here, we utilized the 13CNMR spectral of amino acids and clustered them into various groups. These clusters were used to build feature vectors for the AMP sequences based on the composition, transition, and distribution of cluster members. These features, along with the physicochemical properties of AMPs were exploited to learn computational models to predict active AMPs solely from their sequences. Naรฏve Bayes (NB), k-nearest neighbors (KNN), support-vector machine (SVM), random forest (RF), and eXtreme Gradient Boosting (XGBoost) were employed to build the classification system using the collected AMP datasets from the CAMP, LAMP, ADAM, and AntiBP databases.


Towards A Sentiment Analyzer for Low-Resource Languages

arXiv.org Artificial Intelligence

Twitter is one of the top influenced social media which has a million number of active users. It is commonly used for microblogging that allows users to share messages, ideas, thoughts and many more. Thus, millions interaction such as short messages or tweets are flowing around among the twitter users discussing various topics that has been happening world-wide. This research aims to analyse a sentiment of the users towards a particular trending topic that has been actively and massively discussed at that time. We chose a hashtag \textit{\#kpujangancurang} that was the trending topic during the Indonesia presidential election in 2019. We use the hashtag to obtain a set of data from Twitter to analyse and investigate further the positive or the negative sentiment of the users from their tweets. This research utilizes rapid miner tool to generate the twitter data and comparing Naive Bayes, K-Nearest Neighbor, Decision Tree, and Multi-Layer Perceptron classification methods to classify the sentiment of the twitter data. There are overall 200 labeled data in this experiment. Overall, Naive Bayes and Multi-Layer Perceptron classification outperformed the other two methods on 11 experiments with different size of training-testing data split. The two classifiers are potential to be used in creating sentiment analyzer for low-resource languages with small corpus.


How to Identify Overfitting Machine Learning Models in Scikit-Learn

#artificialintelligence

Overfitting is a common explanation for the poor performance of a predictive model. An analysis of learning dynamics can help to identify whether a model has overfit the training dataset and may suggest an alternate configuration to use that could result in better predictive performance. Performing an analysis of learning dynamics is straightforward for algorithms that learn incrementally, like neural networks, but it is less clear how we might perform the same analysis with other algorithms that do not learn incrementally, such as decision trees, k-nearest neighbors, and other general algorithms in the scikit-learn machine learning library. In this tutorial, you will discover how to identify overfitting for machine learning models in Python. Identify Overfitting Machine Learning Models With Scikit-Learn Photo by Bonnie Moreland, some rights reserved.


Locally Adaptive Nearest Neighbors

arXiv.org Machine Learning

When training automated systems, it has been shown to be beneficial to adapt the representation of data by learning a problem-specific metric. We extend this idea and, for the widely used family of k nearest neighbors algorithms, develop a method that allows learning locally adaptive metrics. To demonstrate important aspects of how our approach works, we conduct a number of experiments on synthetic data sets, and we show its usefulness on real-world benchmark data sets. Machine learning models increasingly pervade our daily lives in the form of recommendation systems, computer vision, driver assistance, etc., challenging us to realize seamless cooperation between human and algorithmic agents. One desirable property of predictions made by machine learning models is their transparency, expressed in such a way as a statement about which factors of a given setting have the greatest influence on the decision at hand - in particular, this requirement aligns with the EU General Data Protection Regulations, which include a "right to explanation" [1].


ELMV: an Ensemble-Learning Approach for Analyzing Electrical Health Records with Significant Missing Values

arXiv.org Machine Learning

Real-world Electronic Health Record (EHR) data have played an important role in improving patient care and clinician experience and providing rich information for biomedical researches [1, 2, 3]. However, many EHR data contain a significant proportion of missing values, which could be as high as 50%, leading to a substantially reduced sample size even in initially large cohorts if we restrict the analysis to individuals with complete data [4, 5]. On the other hand, leaving a big portion of missing information unaddressed usually cause bias, loss of efficiency, and finally leads to inappropriate conclusion to be drawn [6]. Data imputation algorithms (e.g. the scikit-learn estimators [7]) attempt to replace missing data with meaningful values including random values, the mean or median of rows or columns, spatial-temporal regressed values, most frequent values in the same columns, or representative values identified using k-nearest neighbor [8]. Advanced data imputation algorithms, such as Multivariate Imputation by Chained Equation (MICE) [9], have been developed to fill missing values multiple times. Leveraging the power of GPU and big dta, deep neural network models, such as Datawig [10], can estimate more accurate results than traditional data imputation methods [11].