Goto

Collaborating Authors

 Nearest Neighbor Methods


K-Nearest Neighbors Algorithm- A simple overview

#artificialintelligence

Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. K-Nearest Neighbors (KNN) is one of the simplest machine learning algorithms to understand.


Detection of Malicious Websites Using Machine Learning Techniques

arXiv.org Artificial Intelligence

In detecting malicious websites, a common approach is the use of blacklists which are not exhaustive in themselves and are unable to generalize to new malicious sites. Detecting newly encountered malicious websites automatically will help reduce the vulnerability to this form of attack. In this study, we explored the use of ten machine learning models to classify malicious websites based on lexical features and understand how they generalize across datasets. Specifically, we trained, validated, and tested these models on different sets of datasets and then carried out a cross-datasets analysis. From our analysis, we found that K-Nearest Neighbor is the only model that performs consistently high across datasets. Other models such as Random Forest, Decision Trees, Logistic Regression, and Support Vector Machines also consistently outperform a baseline model of predicting every link as malicious across all metrics and datasets. Also, we found no evidence that any subset of lexical features generalizes across models or datasets. This research should be relevant to cybersecurity professionals and academic researchers as it could form the basis for real-life detection systems or further research work.


A Machine Learning Analysis of Impact of the Covid-19 Pandemic on Alcohol Consumption Habit Changes Among Healthcare Workers in the U.S

arXiv.org Artificial Intelligence

In this paper, we discuss the impact of the Covid-19 pandemic on alcohol consumption habit changes among healthcare workers in the United States. We utilize multiple supervised and unsupervised machine learning methods and models such as Decision Trees, Logistic Regression, Naive Bayes classifier, k-Nearest Neighbors, Support Vector Machines, Multilayer perceptron, XGBoost, CatBoost, LightGBM, Chi-Squared Test and mutual information method on a mental health survey data obtained from the University of Michigan Inter-University Consortium for Political and Social Research to find out relationships between COVID-19 related negative effects and alcohol consumption habit changes among healthcare workers. Our findings suggest that COVID-19-related school closures, COVID-19-related work schedule changes and COVID-related news exposure may lead to an increase in alcohol use among healthcare workers in the United States.


Use and Misuse of Machine Learning in Anthropology

arXiv.org Artificial Intelligence

Machine learning (ML), being now widely accessible to the research community at large, has fostered a proliferation of new and striking applications of these emergent mathematical techniques across a wide range of disciplines. In this paper, we will focus on a particular case study: the field of paleoanthropology, which seeks to understand the evolution of the human species based on biological and cultural evidence. As we will show, the easy availability of ML algorithms and lack of expertise on their proper use among the anthropological research community has led to foundational misapplications that have appeared throughout the literature. The resulting unreliable results not only undermine efforts to legitimately incorporate ML into anthropological research, but produce potentially faulty understandings about our human evolutionary and behavioral past. The aim of this paper is to provide a brief introduction to some of the ways in which ML has been applied within paleoanthropology; we also include a survey of some basic ML algorithms for those who are not fully conversant with the field, which remains under active development. We discuss a series of missteps, errors, and violations of correct protocols of ML methods that appear disconcertingly often within the accumulating body of anthropological literature. These mistakes include use of outdated algorithms and practices; inappropriate train/test splits, sample composition, and textual explanations; as well as an absence of transparency due to the lack of data/code sharing, and the subsequent limitations imposed on independent replication. We assert that expanding samples, sharing data and code, re-evaluating approaches to peer review, and, most importantly, developing interdisciplinary teams that include experts in ML are all necessary for progress in future research incorporating ML within anthropology.


Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

arXiv.org Artificial Intelligence

In our experience of working with domain experts who are using today's AutoML systems, a common problem we encountered is what we call "unrealistic expectations" -- when users are facing a very challenging task with a noisy data acquisition process, while being expected to achieve startlingly high accuracy with machine learning (ML). Many of these are predestined to fail from the beginning. In traditional software engineering, this problem is addressed via a feasibility study, an indispensable step before developing any software system. In this paper, we present Snoopy, with the goal of supporting data scientists and machine learning engineers performing a systematic and theoretically founded feasibility study before building ML applications. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER), which stems from data quality issues in datasets used to train or evaluate ML model artifacts. We design a practical Bayes error estimator that is compared against baseline feasibility study candidates on 6 datasets (with additional real and synthetic noise of different levels) in computer vision and natural language processing. Furthermore, by including our systematic feasibility study with additional signals into the iterative label cleaning process, we demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts.


Metric Effects based on Fluctuations in values of k in Nearest Neighbor Regressor

arXiv.org Artificial Intelligence

Regression branch of Machine Learning purely focuses on prediction of continuous values. The supervised learning branch has many regression based methods with parametric and non-parametric learning models. In this paper we aim to target a very subtle point related to distance based regression model. The distance based model used is K-Nearest Neighbors Regressor which is a supervised non-parametric method. The point that we want to prove is the effect of k parameter of the model and its fluctuations affecting the metrics. The metrics that we use are Root Mean Squared Error and R-Squared Goodness of Fit with their visual representation of values with respect to k values.


High-Order Conditional Mutual Information Maximization for dealing with High-Order Dependencies in Feature Selection

arXiv.org Artificial Intelligence

This paper presents a novel feature selection method based on the conditional mutual information (CMI). The proposed High Order Conditional Mutual Information Maximization (HOCMIM) incorporates high order dependencies into the feature selection procedure and has a straightforward interpretation due to its bottom-up derivation. The HOCMIM is derived from the CMI's chain expansion and expressed as a maximization optimization problem. The maximization problem is solved using a greedy search procedure, which speeds up the entire feature selection process. The experiments are run on a set of benchmark datasets (20 in total). The HOCMIM is compared with eighteen state-of-the-art feature selection algorithms, from the results of two supervised learning classifiers (Support Vector Machine and K-Nearest Neighbor). The HOCMIM achieves the best results in terms of accuracy and shows to be faster than high order feature selection counterparts.


Ex-Ante Assessment of Discrimination in Dataset

arXiv.org Artificial Intelligence

Data owners face increasing liability for how the use of their data could harm under-priviliged communities. Stakeholders would like to identify the characteristics of data that lead to algorithms being biased against any particular demographic groups, for example, defined by their race, gender, age, and/or religion. Specifically, we are interested in identifying subsets of the feature space where the ground truth response function from features to observed outcomes differs across demographic groups. To this end, we propose FORESEE, a FORESt of decision trEEs algorithm, which generates a score that captures how likely an individual's response varies with sensitive attributes. Empirically, we find that our approach allows us to identify the individuals who are most likely to be misclassified by several classifiers, including Random Forest, Logistic Regression, Support Vector Machine, and k-Nearest Neighbors. The advantage of our approach is that it allows stakeholders to characterize risky samples that may contribute to discrimination, as well as, use the FORESEE to estimate the risk of upcoming samples.


Label Flipping Data Poisoning Attack Against Wearable Human Activity Recognition System

arXiv.org Artificial Intelligence

Human Activity Recognition (HAR) is a problem of interpreting sensor data to human movement using an efficient machine learning (ML) approach. The HAR systems rely on data from untrusted users, making them susceptible to data poisoning attacks. In a poisoning attack, attackers manipulate the sensor readings to contaminate the training set, misleading the HAR to produce erroneous outcomes. This paper presents the design of a label flipping data poisoning attack for a HAR system, where the label of a sensor reading is maliciously changed in the data collection phase. Due to high noise and uncertainty in the sensing environment, such an attack poses a severe threat to the recognition system. Besides, vulnerability to label flipping attacks is dangerous when activity recognition models are deployed in safety-critical applications. This paper shades light on how to carry out the attack in practice through smartphone-based sensor data collection applications. This is an earlier research work, to our knowledge, that explores attacking the HAR models via label flipping poisoning. We implement the proposed attack and test it on activity recognition models based on the following machine learning algorithms: multi-layer perceptron, decision tree, random forest, and XGBoost. Finally, we evaluate the effectiveness of K-nearest neighbors (KNN)-based defense mechanism against the proposed attack.


Training-Time Attacks against k-Nearest Neighbors

arXiv.org Artificial Intelligence

Nearest neighbor-based methods are commonly used for classification tasks and as subroutines of other data-analysis methods. An attacker with the capability of inserting their own data points into the training set can manipulate the inferred nearest neighbor structure. We distill this goal to the task of performing a training-set data insertion attack against $k$-Nearest Neighbor classification ($k$NN). We prove that computing an optimal training-time (a.k.a. poisoning) attack against $k$NN classification is NP-Hard, even when $k = 1$ and the attacker can insert only a single data point. We provide an anytime algorithm to perform such an attack, and a greedy algorithm for general $k$ and attacker budget. We provide theoretical bounds and empirically demonstrate the effectiveness and practicality of our methods on synthetic and real-world datasets. Empirically, we find that $k$NN is vulnerable in practice and that dimensionality reduction is an effective defense. We conclude with a discussion of open problems illuminated by our analysis.