Goto

Collaborating Authors

 Nearest Neighbor Methods


Riemannian Metric Learning: Closer to You than You Imagine

arXiv.org Machine Learning

In recent decades, machine learning research has focused on developing vector-based representations for various types of data, including images, text, and time series [22]. Learning a meaningful representation space is a foundational task that accelerates research progress, as exemplified by the success of Large Language Models (LLMs) [182]. A complementary challenge is learning a distance function (defining a metric space) that encodes aspects of the data's internal structure. This task is known as distance metric learning, or simply metric learning [20]. Metric learning methods find applications in every field using algorithms relying on a distance such as the ubiquitous k-nearest neighbors classifier: Classification and clustering [195], recommendation systems [89], optimal transport [45], and dimension reduction [116, 186]. However, when using only a global distance, the set of available modeling tools to derive computational algorithms is limited and does not capture the intrinsic data structure. Hence, in this paper, we present a literature review of Riemannian metric learning, a generalization of metric learning that has recently demonstrated success across diverse applications, from causal inference [51, 59, 147] to generative modeling [100, 111, 170]. Unlike metric learning, Riemannian metric learning does not merely learn an embedding capturing distance information, but estimates a Riemannian metric characterizing distributions, curvature, and distances in the dataset, i.e. the Riemannian structure of the data.


Simplicial SMOTE: Oversampling Solution to the Imbalanced Learning Problem

arXiv.org Artificial Intelligence

SMOTE (Synthetic Minority Oversampling Technique) is the established geometric approach to random oversampling to balance classes in the imbalanced learning problem, followed by many extensions. Its idea is to introduce synthetic data points of the minor class, with each new point being the convex combination of an existing data point and one of its k-nearest neighbors. In this paper, by viewing SMOTE as sampling from the edges of a geometric neighborhood graph and borrowing tools from the topological data analysis, we propose a novel technique, Simplicial SMOTE, that samples from the simplices of a geometric neighborhood simplicial complex. A new synthetic point is defined by the barycentric coordinates w.r.t. a simplex spanned by an arbitrary number of data points being sufficiently close rather than a pair. Such a replacement of the geometric data model results in better coverage of the underlying data distribution compared to existing geometric sampling methods and allows the generation of synthetic points of the minority class closer to the majority class on the decision boundary. We experimentally demonstrate that our Simplicial SMOTE outperforms several popular geometric sampling methods, including the original SMOTE. Moreover, we show that simplicial sampling can be easily integrated into existing SMOTE extensions. We generalize and evaluate simplicial extensions of the classic Borderline SMOTE, Safe-level SMOTE, and ADASYN algorithms, all of which outperform their graph-based counterparts.


LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention

arXiv.org Machine Learning

Existing contextual multi-armed bandit (MAB) algorithms fail to effectively capture both long-term trends and local patterns across all arms, leading to suboptimal performance in environments with rapidly changing reward structures. They also rely on static exploration rates, which do not dynamically adjust to changing conditions. To overcome these limitations, we propose LNUCB-TA, a hybrid bandit model integrating a novel nonlinear component (adaptive k-Nearest Neighbors (k-NN)) for reducing time complexity, alongside a global-and-local attention-based exploration mechanism. Our approach uniquely combines linear and nonlinear estimation techniques, with the nonlinear module dynamically adjusting k based on reward variance to enhance spatiotemporal pattern recognition. This reduces the likelihood of selecting suboptimal arms while improving reward estimation accuracy and computational efficiency. The attention-based mechanism ranks arms by past performance and selection frequency, dynamically adjusting exploration and exploitation in real time without requiring manual tuning of exploration rates. By integrating global attention (assessing all arms collectively) and local attention (focusing on individual arms), LNUCB-TA efficiently adapts to temporal and spatial complexities. Empirical results show LNUCB-TA significantly outperforms state-of-the-art linear, nonlinear, and hybrid bandits in cumulative and mean reward, convergence, and robustness across different exploration rates. Theoretical analysis further confirms its reliability with a sub-linear regret bound.


Statistically Significant $k$NNAD by Selective Inference

arXiv.org Machine Learning

In this paper, we investigate the problem of unsupervised anomaly detection using the k-Nearest Neighbor method. The k-Nearest Neighbor Anomaly Detection (kNNAD) is a simple yet effective approach for identifying anomalies across various domains and fields. A critical challenge in anomaly detection, including kNNAD, is appropriately quantifying the reliability of detected anomalies. To address this, we formulate kNNAD as a statistical hypothesis test and quantify the probability of false detection using $p$-values. The main technical challenge lies in performing both anomaly detection and statistical testing on the same data, which hinders correct $p$-value calculation within the conventional statistical testing framework. To resolve this issue, we introduce a statistical hypothesis testing framework called Selective Inference (SI) and propose a method named Statistically Significant NNAD (Stat-kNNAD). By leveraging SI, the Stat-kNNAD method ensures that detected anomalies are statistically significant with theoretical guarantees. The proposed Stat-kNNAD method is applicable to anomaly detection in both the original feature space and latent feature spaces derived from deep learning models. Through numerical experiments on synthetic data and applications to industrial product anomaly detection, we demonstrate the validity and effectiveness of the Stat-kNNAD method.


Hybrid Machine Learning Models for Intrusion Detection in IoT: Leveraging a Real-World IoT Dataset

arXiv.org Artificial Intelligence

The rapid growth of the Internet of Things (IoT) has revolutionized industries, enabling unprecedented connectivity and functionality. However, this expansion also increases vulnerabilities, exposing IoT networks to increasingly sophisticated cyberattacks. Intrusion Detection Systems (IDS) are crucial for mitigating these threats, and recent advancements in Machine Learning (ML) offer promising avenues for improvement. This research explores a hybrid approach, combining several standalone ML models such as Random Forest (RF), XGBoost, K-Nearest Neighbors (KNN), and AdaBoost, in a voting-based hybrid classifier for effective IoT intrusion detection. This ensemble method leverages the strengths of individual algorithms to enhance accuracy and address challenges related to data complexity and scalability. Using the widely-cited IoT-23 dataset, a prominent benchmark in IoT cybersecurity research, we evaluate our hybrid classifiers for both binary and multi-class intrusion detection problems, ensuring a fair comparison with existing literature. Results demonstrate that our proposed hybrid models, designed for robustness and scalability, outperform standalone approaches in IoT environments. This work contributes to the development of advanced, intelligent IDS frameworks capable of addressing evolving cyber threats.


PDA: Generalizable Detection of AI-Generated Images via Post-hoc Distribution Alignment

arXiv.org Artificial Intelligence

The rapid advancement of generative models has led to the proliferation of highly realistic AI-generated images, posing significant challenges for detection methods to generalize across diverse and evolving generative techniques. Existing approaches often fail to adapt to unknown models without costly retraining, limiting their practicability. To fill this gap, we propose Post-hoc Distribution Alignment (PDA), a novel approach for the generalizable detection for AI-generated images. The key idea is to use the known generative model to regenerate undifferentiated test images. This process aligns the distributions of the re-generated real images with the known fake images, enabling effective distinction from unknown fake images. PDA employs a two-step detection framework: 1) evaluating whether a test image aligns with the known fake distribution based on deep k-nearest neighbor (KNN) distance, and 2) re-generating test images using known generative models to create pseudo-fake images for further classification. This alignment strategy allows PDA to effectively detect fake images without relying on unseen data or requiring retraining. Extensive experiments demonstrate the superiority of PDA, achieving 96.73\% average accuracy across six state-of-the-art generative models, including GANs, diffusion models, and text-to-image models, and improving by 16.07\% over the best baseline. Through t-SNE visualizations and KNN distance analysis, we provide insights into PDA's effectiveness in separating real and fake images. Our work provides a flexible and effective solution for real-world fake image detection, advancing the generalization ability of detection systems.


Privacy-Preserving Hybrid Ensemble Model for Network Anomaly Detection: Balancing Security and Data Protection

arXiv.org Artificial Intelligence

Privacy-preserving network anomaly detection has become an essential area of research due to growing concerns over the protection of sensitive data. Traditional anomaly de- tection models often prioritize accuracy while neglecting the critical aspect of privacy. In this work, we propose a hybrid ensemble model that incorporates privacy-preserving techniques to address both detection accuracy and data protection. Our model combines the strengths of several machine learning algo- rithms, including K-Nearest Neighbors (KNN), Support Vector Machines (SVM), XGBoost, and Artificial Neural Networks (ANN), to create a robust system capable of identifying network anomalies while ensuring privacy. The proposed approach in- tegrates advanced preprocessing techniques that enhance data quality and address the challenges of small sample sizes and imbalanced datasets. By embedding privacy measures into the model design, our solution offers a significant advancement over existing methods, ensuring both enhanced detection performance and strong privacy safeguards.


Near-optimal sample compression for nearest neighbors

Neural Information Processing Systems

We present the first sample compression algorithm for nearest neighbors with nontrivial performance guarantees. We complement these guarantees by demonstrating almost matching hardness lower bounds, which show that our bound is nearly optimal. Our result yields new insight into margin-based nearest neighbor classification in metric spaces and allows us to significantly sharpen and simplify existing bounds. Some encouraging empirical results are also presented.


Finite-Sample Analysis of Fixed-k Nearest Neighbor Density Functional Estimators

Neural Information Processing Systems

We provide finite-sample analysis of a general framework for using k-nearest neighbor statistics to estimate functionals of a nonparametric continuous probability density, including entropies and divergences. Rather than plugging a consistent density estimate (which requires k as the sample size n) into the functional of interest, the estimators we consider fix k and perform a bias correction. This can be more efficient computationally, and, as we show, statistically, leading to faster convergence rates. Our framework unifies several previous estimators, for most of which ours are the first finite sample guarantees.


k*-Nearest Neighbors: From Global to Local

Neural Information Processing Systems

The weighted k-nearest neighbors algorithm is one of the most fundamental non-parametric methods in pattern recognition and machine learning. The question of setting the optimal number of neighbors as well as the optimal weights has received much attention throughout the years, nevertheless this problem seems to have remained unsettled. In this paper we offer a simple approach to locally weighted regression/classification, where we make the bias-variance tradeoff explicit. Our formulation enables us to phrase a notion of optimal weights, and to efficiently find these weights as well as the optimal number of neighbors efficiently and adaptively, for each data point whose value we wish to estimate. The applicability of our approach is demonstrated on several datasets, showing superior performance over standard locally weighted methods.