knn
SupplementaryMaterial
We adapt the convergence proof from [46], it can be summarized in two parts. This definition is the same as Def. 1, but for an arbitrary operator. Our final goal is to show that Gฮฑ(xi,xj) Dฮฑ,Ht(xi,xj), and then conclude with Thm. 1 which gives the equivalence Dฮฑ,Ht(xi,xj) d2ฮฑM(xi,xj). SincewewillletK, we ignore the second term inGฮฑ(xi,xj), as it converges to zero. Bythestronglawoflargenumbers,theconvergence is with probability one.
Regression with reject option and application to kNN
We investigate the problem of regression where one is allowed to abstain from predicting. We refer to this framework as regression with reject option as an extension of classification with reject option. In this context, we focus on the case where the rejection rate is fixed and derive the optimal rule which relies on thresholding the conditional variance function. We provide a semi-supervised estimation procedure of the optimal rule involving two datasets: a first labeled dataset is used to estimate both regression function and conditional variance function while a second unlabeled dataset is exploited to calibrate the desired rejection rate. The resulting predictor with reject option is shown to be almost as good as the optimal predictor with reject option both in terms of risk and rejection rate. We additionally apply our methodology with kNN algorithm and establish rates of convergence for the resulting kNN predictor under mild conditions. Finally, a numerical study is performed to illustrate the benefit of using the proposed procedure.
An Improved Ensemble-Based Machine Learning Model with Feature Optimization for Early Diabetes Prediction
Islam, Md. Najmul, Rimon, Md. Miner Hossain, Shamim, Shah Sadek-E-Akbor, Fahad, Zarif Mohaimen, Mony, Md. Jehadul Islam, Chowdhury, Md. Jalal Uddin
Diabetes is a serious worldwide health issue, and successful intervention depends on early detection. However, overlapping risk factors and data asymmetry make prediction difficult. To use extensive health survey data to create a machine learning framework for diabetes classification that is both accurate and comprehensible, to produce results that will aid in clinical decision-making. Using the BRFSS dataset, we assessed a number of supervised learning techniques. SMOTE and Tomek Links were used to correct class imbalance. To improve prediction performance, both individual models and ensemble techniques such as stacking were investigated. The 2015 BRFSS dataset, which includes roughly 253,680 records with 22 numerical features, is used in this study. Strong ROC-AUC performance of approximately 0.96 was attained by the individual models Random Forest, XGBoost, CatBoost, and LightGBM.The stacking ensemble with XGBoost and KNN yielded the best overall results with 94.82\% accuracy, ROC-AUC of 0.989, and PR-AUC of 0.991, indicating a favourable balance between recall and precision. In our study, we proposed and developed a React Native-based application with a Python Flask backend to support early diabetes prediction, providing users with an accessible and efficient health monitoring tool.
The Generalized Proximity Forest
Shaw, Ben, Rustad, Adam, Maia, Sofia Pelagalli, Rhodes, Jake S., Moon, Kevin R.
Abstract--Recent work has demonstrated the utility of Random Forest (RF) proximities for various supervised machine learning tasks, including outlier detection, missing data imputation, and visualization. However, the utility of the RF proximities depends upon the success of the RF model, which itself is not the ideal model in all contexts. RF proximities have recently been extended to time series by means of the distance-based Proximity Forest (PF) model, among others, affording time series analysis with the benefits of RF proximities. In this work, we introduce the generalized PF model, thereby extending RF proximities to all contexts in which supervised distance-based machine learning can occur . Additionally, we introduce a variant of the PF model for regression tasks. We also introduce the notion of using the generalized PF model as a meta-learning framework, extending supervised imputation capability to any pre-trained classifier . We experimentally demonstrate the unique advantages of the generalized PF model compared with both the RF model and the k-nearest neighbors model.
Scam Shield: Multi-Model Voting and Fine-Tuned LLMs Against Adversarial Attacks
Chang, Chen-Wei, Sarkar, Shailik, Salemi, Hossein, Kim, Hyungmin, Mitra, Shutonu, Purohit, Hemant, Zhang, Fengxiu, Hong, Michin, Cho, Jin-Hee, Lu, Chang-Tien
Scam detection remains a critical challenge in cybersecurity as adversaries craft messages that evade automated filters. We propose a Hierarchical Scam Detection System (HSDS) that combines a lightweight multi-model voting front end with a fine-tuned LLaMA 3.1 8B Instruct back end to improve accuracy and robustness against adversarial attacks. An ensemble of four classifiers provides preliminary predictions through majority vote, and ambiguous cases are escalated to the fine-tuned model, which is optimized with adversarial training to reduce misclassification. Experiments show that this hierarchical design both improves adversarial scam detection and shortens inference time by routing most cases away from the LLM, outperforming traditional machine-learning baselines and proprietary LLM baselines. The findings highlight the effectiveness of a hybrid voting mechanism and adversarial fine-tuning in fortifying LLMs against evolving scam tactics, enhancing the resilience of automated scam detection systems.
A New Perspective on Precision and Recall for Generative Models
Sykes, Benjamin, Simon, Loรฏc, Rabin, Julien, Fadili, Jalal
With the recent success of generative models in image and text, the question of their evaluation has recently gained a lot of attention. While most methods from the state of the art rely on scalar metrics, the introduction of Precision and Recall (PR) for generative model has opened up a new avenue of research. The associated PR curve allows for a richer analysis, but their estimation poses several challenges. In this paper, we present a new framework for estimating entire PR curves based on a binary classification standpoint. We conduct a thorough statistical analysis of the proposed estimates. As a byproduct, we obtain a minimax upper bound on the PR estimation risk. We also show that our framework extends several landmark PR metrics of the literature which by design are restrained to the extreme values of the curve. Finally, we study the different behaviors of the curves obtained experimentally in various settings.
Transformer-Gather, Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution
Sharifi, Mohammadreza, Ahmadzadeh, Danial
Entity resolution plays a significant role in enterprise systems where data integrity must be rigorously maintained. Traditional methods often struggle with handling noisy data or semantic understanding, while modern methods suffer from computational costs or the excessive need for parallel computation. In this study, we introduce a scalable hybrid framework, which is designed to address several important problems, including scalability, noise robustness, and reliable results. We utilized a pre-trained language model to encode each structured data into corresponding semantic embedding vectors. Subsequently, after retrieving a semantically relevant subset of candidates, we apply a syntactic verification stage using fuzzy string matching techniques to refine classification on the unlabeled data. This approach was applied to a real-world entity resolution task, which exposed a linkage between a central user management database and numerous shared hosting server records. Compared to other methods, this approach exhibits an outstanding performance in terms of both processing time and robustness, making it a reliable solution for a server-side product. Crucially, this efficiency does not compromise results, as the system maintains a high retrieval recall of approximately 0.97. The scalability of the framework makes it deployable on standard CPU-based infrastructure, offering a practical and effective solution for enterprise-level data integrity auditing.
Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models
Cao, Jiaqi, Wang, Jiarui, Wei, Rubin, Guo, Qipeng, Chen, Kai, Zhou, Bowen, Lin, Zhouhan
Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model's parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.