Nearest Neighbor Methods
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
Belkin, Mikhail, Hsu, Daniel, Mitra, Partha
Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for "overfitted" / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and weighted $k$-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems. These schemes have an inductive bias that benefits from higher dimension, a kind of "blessing of dimensionality". Finally, connections to kernel machines, random forests, and adversarial examples in the interpolated regime are discussed.
Employee Attrition Prediction
Yedida, Rahul, Reddy, Rahul, Vahi, Rakshit, Jana, Rahul, GV, Abhilash, Kulkarni, Deepti
We aim to predict whether an employee of a company will leave or not, using the k-Nearest Neighbors algorithm. We use evaluation of employee performance, average monthly hours at work and number of years spent in the company, among others, as our features. Other approaches to this problem include the use of ANNs, decision trees and logistic regression. The dataset was split, using 70% for training the algorithm and 30% for testing it, achieving an accuracy of 94.32%.
Analyzing the Robustness of Nearest Neighbors to Adversarial Examples
Wang, Yizhen, Jha, Somesh, Chaudhuri, Kamalika
Motivated by safety-critical applications, test-time attacks on classifiers via adversarial examples has recently received a great deal of attention. However, there is a general lack of understanding on why adversarial examples arise; whether they originate due to inherent properties of data or due to lack of training samples remains ill-understood. In this work, we introduce a theoretical framework analogous to bias-variance theory for understanding these effects. We use our framework to analyze the robustness of a canonical non-parametric classifier - the k-nearest neighbors. Our analysis shows that its robustness properties depend critically on the value of k - the classifier may be inherently non-robust for small k, but its robustness approaches that of the Bayes Optimal classifier for fast-growing k. We propose a novel modified 1-nearest neighbor classifier, and guarantee its robustness in the large sample limit. Our experiments suggest that this classifier may have good robustness properties even for reasonable data set sizes.
Neural Message Passing with Edge Updates for Predicting Properties of Molecules and Materials
Jรธrgensen, Peter Bjรธrn, Jacobsen, Karsten Wedel, Schmidt, Mikkel N.
Neural message passing on molecular graphs is one of the most promising methods for predicting formation energy and other properties of molecules and materials. In this work we extend the neural message passing model with an edge update network which allows the information exchanged between atoms to depend on the hidden state of the receiving atom. We benchmark the proposed model on three publicly available datasets (QM9, The Materials Project and OQMD) and show that the proposed model yields superior prediction of formation energies and other properties on all three datasets in comparison with the best published results. Furthermore we investigate different methods for constructing the graph used to represent crystalline structures and we find that using a graph based on K-nearest neighbors achieves better prediction accuracy than using maximum distance cutoff or the Voronoi tessellation graph.
Hedging with Machine Learning
There are many ways to reduce trading risk through hedging. Funds typically use futures and options to hedge each trade. Similar to insurance, this safety net comes at a price. Using an AI-based strategy, though, there is a way to protect a position at a much lower cost. Before McDonald's could introduce Chicken McNuggets, they had to hedge against the cost of chicken. If chicken prices rose dramatically, they would no longer be able to offer the product.
Machine Learning Classification Algorithms using MATLAB
This course is for you If you are being fascinated by the field of Machine Learning? This course is designed to cover one of the most interesting areas of machine learning called classification. I will take you step-by-step in this course and will first cover the basics of MATLAB. Following that we will look into the details of how to use different machine learning algorithms using MATLAB. Specifically, we will be looking at the MATLAB toolbox called statistic and machine learning toolbox.We will implement some of the most commonly used classification algorithms such as K-Nearest Neighbor, Naive Bayes, Discriminant Analysis, Decision Tress, Support Vector Machines, Error Correcting Ouput Codes and Ensembles. Following that we will be looking at how to cross validate these models and how to evaluate their performances.
A Comparative Study of Classification Techniques in Data Mining Algorithms
Classification is used to find out in which group each data instance is related within a given dataset. It is used for classifying data into different classes according to some constrains. Several major kinds of classification algorithms including C4.5, ID3, k-nearest neighbor classifier, Naive Bayes, SVM, and ANN are used for classification. Generally a classification technique follows three approaches Statistical, Machine Learning and Neural Network for classification. While considering these approaches this paper provides an inclusive survey of different classification algorithms and their features and limitations.
A Beginner's Guide to Machine Learning (in Python)
In this course, you will learn the basics of Machine Learning and Data Mining; almost everything you need to get started. You will understand what Big Data is and what Data Science and Data Analytics is. You will learn algorithms such as Linear Regression, Logistic Regression, Support Vector Machine, K-Nearest Neighbor, Decision Trees, and Neural Networks. You'll also understand how to combine algorithms into ensembles. Preprocessing data will be taught and you will understand how to clean your data, transform it, how to handle categorical features, and how to handle unbalanced data.
The Enemy of My Enemy Is My Friend: Class-to-Class Weighting in K-Nearest Neighbors Algorithm
Ye, Xiaomeng (Indiana University Bloomington)
The K-nearest neighbors algorithm (k-NN) is widely used in instance-based learning and case-based reasoning. The basic k-NN approach has been refined and augmented in many ways, including the use of local weighting, asymmetric metrics, and class-specific weighting, which enables the use of different similarity criteria for each class. This paper extends class-specific weighting with a method we call class-to-class (C2C) weighting. Beyond class-specific weighting, which learns feature weightings to identify the most similar cases to a class, C2C weighting also focuses on learning differences between classes to potentially apply those differences to classification. Once C2C weighting has learned how class C_1 is different from class C_2, given a new case is different from a C_1 case in a way similar to the way C_2 cases are different from C_1 cases, then the new case is assigned to class C_2. C2C offers two potential advantages: First, unlike global weighting, it is robust to deletion of the cases in a given class, because non-native class weightings can still make relatively good predictions. We demonstrate experimentally that this can be true even when a whole class of cases is dropped. Additionally, C2C might provide a new potential form of explainability, in explaining classifications based on pattern of differences. Preliminary results suggest that in normal settings C2C offers accuracy comparable to standard methods, though slightly lower. However, with our initial learning method, the native class weightings of C2C weighting are easily skewed and can lead to worse performance than traditional global weightings. We argue this is not an intrinsic flaw in C2C weighting, but rather an issue in the combination of C2C weighting with global weighting, and propose an approach to address this issue.
Machine Learning Classification Algorithms using MATLAB
This course is for you If you are being fascinated by the field of Machine Learning? This course is designed to cover one of the most interesting areas of machine learning called classification. I will take you step-by-step in this course and will first cover the basics of MATLAB. Following that we will look into the details of how to use different machine learning algorithms using MATLAB. Specifically, we will be looking at the MATLAB toolbox called statistic and machine learning toolbox.We will implement some of the most commonly used classification algorithms such as K-Nearest Neighbor, Naive Bayes, Discriminant Analysis, Decision Tress, Support Vector Machines, Error Correcting Ouput Codes and Ensembles. Following that we will be looking at how to cross validate these models and how to evaluate their performances.