Nearest Neighbor Methods
Adaptive $k$-nearest neighbor classifier based on the local estimation of the shape operator
Levada, Alexandre Luรญs Magalhรฃes, Nielsen, Frank, Haddad, Michel Ferreira Cardia
The $k$-nearest neighbor ($k$-NN) algorithm is one of the most popular methods for nonparametric classification. However, a relevant limitation concerns the definition of the number of neighbors $k$. This parameter exerts a direct impact on several properties of the classifier, such as the bias-variance tradeoff, smoothness of decision boundaries, robustness to noise, and class imbalance handling. In the present paper, we introduce a new adaptive $k$-nearest neighbours ($kK$-NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size. The rationale is that points with low curvature could have larger neighborhoods (locally, the tangent space approximates well the underlying data shape), whereas points with high curvature could have smaller neighborhoods (locally, the tangent space is a loose approximation). We estimate the local Gaussian curvature by computing an approximation to the local shape operator in terms of the local covariance matrix as well as the local Hessian matrix. Results on many real-world datasets indicate that the new $kK$-NN algorithm yields superior balanced accuracy compared to the established $k$-NN method and also another adaptive $k$-NN algorithm. This is particularly evident when the number of samples in the training data is limited, suggesting that the $kK$-NN is capable of learning more discriminant functions with less data considering many relevant cases.
Classification and Prediction of Heart Diseases using Machine Learning Algorithms
Osei-Nkwantabisa, Akua Sekyiwaa, Ntumy, Redeemer
Heart disease is a serious worldwide health issue because it claims the lives of many people who might have been treated if the disease had been identified earlier. The leading cause of death in the world is cardiovascular disease, usually referred to as heart disease. Creating reliable, effective, and precise predictions for these diseases is one of the biggest issues facing the medical world today. Although there are tools for predicting heart diseases, they are either expensive or challenging to apply for determining a patient's risk. The best classifier for foretelling and spotting heart disease was the aim of this research. This experiment examined a range of machine learning approaches, including Logistic Regression, K-Nearest Neighbor, Support Vector Machine, and Artificial Neural Networks, to determine which machine learning algorithm was most effective at predicting heart diseases. One of the most often utilized data sets for this purpose, the UCI heart disease repository provided the data set for this study. The K-Nearest Neighbor technique was shown to be the most effective machine learning algorithm for determining whether a patient has heart disease. It will be beneficial to conduct further studies on the application of additional machine learning algorithms for heart disease prediction.
NoPhish: Efficient Chrome Extension for Phishing Detection Using Machine Learning Techniques
Thaqi, Leand, Halili, Arbnor, Vishi, Kamer, Rexha, Blerim
The growth of digitalization services via web browsers has simplified our daily routine of doing business. But at the same time, it has made the web browser very attractive for several cyber-attacks. Web phishing is a well-known cyberattack that is used by attackers camouflaging as trustworthy web servers to obtain sensitive user information such as credit card numbers, bank information, personal ID, social security number, and username and passwords. In recent years many techniques have been developed to identify the authentic web pages that users visit and warn them when the webpage is phishing. In this paper, we have developed an extension for Chrome the most favorite web browser, that will serve as a middleware between the user and phishing websites. The Chrome extension named "NoPhish" shall identify a phishing webpage based on several Machine Learning techniques. We have used the training dataset from "PhishTank" and extracted the 22 most popular features as rated by the Alexa database. The training algorithms used are Random Forest, Support Vector Machine, and k-Nearest Neighbor. The performance results show that Random Forest delivers the best precision.
Data is missing again -- Reconstruction of power generation data using $k$-Nearest Neighbors and spectral graph theory
Pierrot, Amandine, Pinson, Pierre
The risk of missing data and subsequent incomplete data records at wind farms increases with the number of turbines and sensors. We propose here an imputation method that blends data-driven concepts with expert knowledge, by using the geometry of the wind farm in order to provide better estimates when performing Nearest Neighbor imputation. Our method relies on learning Laplacian eigenmaps out of the graph of the wind farm through spectral graph theory. These learned representations can be based on the wind farm layout only, or additionally account for information provided by collected data. The related weighted graph is allowed to change with time and can be tracked in an online fashion. Application to the Westermost Rough offshore wind farm shows significant improvement over approaches that do not account for the wind farm layout information.
Machine Learning-Based Research on the Adaptability of Adolescents to Online Education
With the rapid advancement of internet technology, the adaptability of adolescents to online learning has emerged as a focal point of interest within the educational sphere. However, the academic community's efforts to develop predictive models for adolescent online learning adaptability require further refinement and expansion. Utilizing data from the "Chinese Adolescent Online Education Survey" spanning the years 2014 to 2016, this study implements five machine learning algorithms - logistic regression, K-nearest neighbors, random forest, XGBoost, and CatBoost - to analyze the factors influencing adolescent online learning adaptability and to determine the model best suited for prediction. The research reveals that the duration of courses, the financial status of the family, and age are the primary factors affecting students' adaptability in online learning environments. Additionally, age significantly impacts students' adaptive capacities. Among the predictive models, the random forest, XGBoost, and CatBoost algorithms demonstrate superior forecasting capabilities, with the random forest model being particularly adept at capturing the characteristics of students' adaptability.
Benchmarking ML Approaches to UWB-Based Range-Only Posture Recognition for Human Robot-Interaction
Salimi, Salma, Salimpour, Sahar, Queralta, Jorge Peรฑa, Bessa, Wallace Moreira, Westerlund, Tomi
Human pose estimation involves detecting and tracking the positions of various body parts using input data from sources such as images, videos, or motion and inertial sensors. This paper presents a novel approach to human pose estimation using machine learning algorithms to predict human posture and translate them into robot motion commands using ultra-wideband (UWB) nodes, as an alternative to motion sensors. The study utilizes five UWB sensors implemented on the human body to enable the classification of still poses and more robust posture recognition. This approach ensures effective posture recognition across a variety of subjects. These range measurements serve as input features for posture prediction models, which are implemented and compared for accuracy. For this purpose, machine learning algorithms including K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and deep Multi-Layer Perceptron (MLP) neural network are employed and compared in predicting corresponding postures. We demonstrate the proposed approach for real-time control of different mobile/aerial robots with inference implemented in a ROS 2 node. Experimental results demonstrate the efficacy of the approach, showcasing successful prediction of human posture and corresponding robot movements with high accuracy.
Pathfinding with Lazy Successor Generation
We study a pathfinding problem where only locations (i.e., vertices) are given, and edges are implicitly defined by an oracle answering the connectivity of two locations. Despite its simple structure, this problem becomes non-trivial with a massive number of locations, due to posing a huge branching factor for search algorithms. Limiting the number of successors, such as with nearest neighbors, can reduce search efforts but compromises completeness. Instead, we propose a novel LaCAS* algorithm, which does not generate successors all at once but gradually generates successors as the search progresses. This scheme is implemented with k-nearest neighbors search on a k-d tree. LaCAS* is a complete and anytime algorithm that eventually converges to the optima. Extensive evaluations demonstrate the efficacy of LaCAS*, e.g., solving complex pathfinding instances quickly, where conventional methods falter.
SHEDAD: SNN-Enhanced District Heating Anomaly Detection for Urban Substations
van Dreven, Jonne, Cheddad, Abbas, Alawadi, Sadi, Ghazi, Ahmad Nauman, Koussa, Jad Al, Vanhoudt, Dirk
District Heating (DH) systems are essential for energy-efficient urban heating. However, despite the advancements in automated fault detection and diagnosis (FDD), DH still faces challenges in operational faults that impact efficiency. This study introduces the Shared Nearest Neighbor Enhanced District Heating Anomaly Detection (SHEDAD) approach, designed to approximate the DH network topology and allow for local anomaly detection without disclosing sensitive information, such as substation locations. The approach leverages a multi-adaptive k-Nearest Neighbor (k-NN) graph to improve the initial neighborhood creation. Moreover, it introduces a merging technique that reduces noise and eliminates trivial edges. We use the Median Absolute Deviation (MAD) and modified z-scores to flag anomalous substations. The results reveal that SHEDAD outperforms traditional clustering methods, achieving significantly lower intra-cluster variance and distance. Additionally, SHEDAD effectively isolates and identifies two distinct categories of anomalies: supply temperatures and substation performance. We identified 30 anomalous substations and reached a sensitivity of approximately 65\% and specificity of approximately 97\%. By focusing on this subset of poor-performing substations in the network, SHEDAD enables more targeted and effective maintenance interventions, which can reduce energy usage while optimizing network performance.
MAC protocol classification in the ISM band using machine learning methods
Rashidpour, Hanieh, Bahramgiri, Hossein
With the emergence of new technologies and a growing number of wireless networks, we face the problem of radio spectrum shortages. As a result, identifying the wireless channel spectrum to exploit the channel's idle state while also boosting network security is a pivotal issue. Detecting and classifying protocols in the MAC sublayer enables Cognitive Radio users to improve spectrum utilization and minimize potential interference. In this paper, we classify the Wi-Fi and Bluetooth protocols, which are the most widely used MAC sublayer protocols in the ISM radio band. With the advent of various wireless technologies, especially in the 2.4 GHz frequency band, the ISM frequency spectrum has become crowded and high-traffic, which faces a lack of spectrum resources and user interference. Therefore, identifying and classifying protocols is an effective and useful method. Leveraging machine learning and deep learning techniques, known for their advanced classification capabilities, we apply Support Vector Machine and K-Nearest Neighbors algorithms, which are machine learning algorithms, to classify protocols into three classes: Wi-Fi, Wi-Fi Beacon, and Bluetooth. To capture the signals, we use the USRP N210 Software Defined Radio device and sample the real data in the indoor environment in different conditions of the presence and absence of transmitters and receivers for these two protocols. By assembling this dataset and studying the time and frequency features of the protocols, we extract the frame width and the silence gap between the two frames as time features and the PAPR of each frame as a power feature. By comparing the output of the protocols classification in different conditions and also adding Gaussian noise, it was found that the samples in the nonlinear SVM method with RBF and KNN functions have the best performance, with 97.83% and 98.12% classification accuracy, respectively.
Simply Trainable Nearest Neighbour Machine Translation with GPU Inference
Amer, Hossam, Abouelenin, Abdelrahman, Maher, Mohamed, Narouz, Evram, Afify, Mohamed, Awadallah, Hany
Nearest neighbor machine translation is a successful approach for fast domain adaption, which interpolates the pre-trained transformers with domain-specific token-level k-nearest-neighbor (kNN) retrieval without retraining. Despite kNN MT's success, searching large reference corpus and fixed interpolation between the kNN and pre-trained model led to computational complexity and translation quality challenges. Among other papers, Dai et al. (2023) proposed methods to obtain a small number of reference samples dynamically for which they introduced a distance-aware interpolation method using an equation that includes free parameters. This paper proposes a simply trainable nearest neighbor machine translation and carry out inference experiments on GPU. Similar to Dai et al. (2023), we first adaptively construct a small datastore for each input sentence. Second, we train a single-layer network for the interpolation coefficient between the knnMT and pre-trained result to automatically interpolate in different domains. Experimental results on different domains show that our proposed method either improves or sometimes maintain the translation quality of methods in Dai et al. (2023) while being automatic. In addition, our GPU inference results demonstrate that knnMT can be integrated into GPUs with a drop of only 5% in terms of speed.