Support Vector Machines
The R package psvmSDR: A Unified Algorithm for Sufficient Dimension Reduction via Principal Machines
Shin, Jungmin, Shin, Seung Jun, Artemiou, Andreas
Sufficient dimension reduction (SDR), which seeks a lower-dimensional subspace of the predictors containing regression or classification information has been popular in a machine learning community. In this work, we present a new R software package psvmSDR that implements a new class of SDR estimators, which we call the principal machine (PM) generalized from the principal support vector machine (PSVM). The package covers both linear and nonlinear SDR and provides a function applicable to realtime update scenarios. The package implements the descent algorithm for the PMs to efficiently compute the SDR estimators in various situations. This easy-to-use package will be an attractive alternative to the dr R package that implements classical SDR methods.
Comparison of Epilepsy Induced by Ischemic Hypoxic Brain Injury and Hypoglycemic Brain Injury using Multilevel Fusion of Data Features
Kadem, Sameer, Sami, Noor, Elaraby, Ahmed, Alyousif, Shahad, Jalil, Mohammed, Altaee, M., Almusawi, Muntather, Ismaeel, A. Ghany, Kareem, Ali Kamil, Kamalrudin, Massila, ftaiet, Adnan Allwi
The study aims to investigate the similarities and differences in the brain damage caused by Hypoxia-Ischemia (HI), Hypoglycemia, and Epilepsy. Hypoglycemia poses a significant challenge in improving glycemic regulation for insulin-treated patients, while HI brain disease in neonates is associated with low oxygen levels. The study examines the possibility of using a combination of medical data and Electroencephalography (EEG) measurements to predict outcomes over a two-year period. The study employs a multilevel fusion of data features to enhance the accuracy of the predictions. Therefore this paper suggests a hybridized classification model for Hypoxia-Ischemia and Hypoglycemia, Epilepsy brain injury (HCM-BI). A Support Vector Machine is applied with clinical details to define the Hypoxia-Ischemia outcomes of each infant. The newborn babies are assessed every two years again to know the neural development results. A selection of four attributes is derived from the Electroencephalography records, and SVM does not get conclusions regarding the classification of diseases. The final feature extraction of the EEG signal is optimized by the Bayesian Neural Network (BNN) to get the clear health condition of Hypoglycemia and Epilepsy patients. Through monitoring and assessing physical effects resulting from Electroencephalography, The Bayesian Neural Network (BNN) is used to extract the test samples with the most log data and to report hypoglycemia and epilepsy Keywords- Hypoxia-Ischemia , Hypoglycemia , Epilepsy , Multilevel Fusion of Data Features , Bayesian Neural Network (BNN) , Support Vector Machine (SVM)
Optimizing Mortality Prediction for ICU Heart Failure Patients: Leveraging XGBoost and Advanced Machine Learning with the MIMIC-III Database
Ashrafi, Negin, Abdollahi, Armin, Zhang, Jiahong, Pishgar, Maryam
Heart failure affects millions of people worldwide, significantly reducing quality of life and leading to high mortality rates. Despite extensive research, the relationship between heart failure and mortality rates among ICU patients is not fully understood, indicating the need for more accurate prediction models. This study analyzed data from 1,177 patients over 18 years old from the MIMIC-III database, identified using ICD-9 codes. Preprocessing steps included handling missing data, removing duplicates, treating skewness, and using oversampling techniques to address data imbalances. Through rigorous feature selection using Variance Inflation Factor (VIF), expert clinical input, and ablation studies, 46 key features were identified to enhance model performance. Our analysis compared several machine learning models, including Logistic Regression, Support Vector Machine (SVM), Random Forest, LightGBM, and XGBoost. XGBoost emerged as the superior model, achieving a test AUC-ROC of 0.9228 (95\% CI 0.8748 - 0.9613), significantly outperforming our previous work (AUC-ROC of 0.8766) and the best results reported in existing literature (AUC-ROC of 0.824). The improved model's success is attributed to advanced feature selection methods, robust preprocessing techniques, and comprehensive hyperparameter optimization through Grid-Search. SHAP analysis and feature importance evaluations based on XGBoost highlighted key variables like leucocyte count and RDW, providing valuable insights into the clinical factors influencing mortality risk. This framework offers significant support for clinicians, enabling them to identify high-risk ICU heart failure patients and improve patient outcomes through timely and informed interventions.
Towards Split Learning-based Privacy-Preserving Record Linkage
Zervas, Michail, Karakasidis, Alexandros
Split Learning has been recently introduced to facilitate applications where user data privacy is a requirement. However, it has not been thoroughly studied in the context of Privacy-Preserving Record Linkage, a problem in which the same real-world entity should be identified among databases from different dataholders, but without disclosing any additional information. In this paper, we investigate the potentials of Split Learning for Privacy-Preserving Record Matching, by introducing a novel training method through the utilization of Reference Sets, which are publicly available data corpora, showcasing minimal matching impact against a traditional centralized SVM-based technique.
A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning
Yuan, Ye, Liu, Chengwu, Yuan, Jingyang, Sun, Gongbo, Li, Siqi, Zhang, Ming
Retrieval-augmented generation (RAG) is a framework enabling large language models (LLMs) to enhance their accuracy and reduce hallucinations by integrating external knowledge bases. In this paper, we introduce a hybrid RAG system enhanced through a comprehensive suite of optimizations that significantly improve retrieval quality, augment reasoning capabilities, and refine numerical computation ability. We refined the text chunks and tables in web pages, added attribute predictors to reduce hallucinations, conducted LLM Knowledge Extractor and Knowledge Graph Extractor, and finally built a reasoning strategy with all the references. We evaluated our system on the CRAG dataset through the Meta CRAG KDD Cup 2024 Competition. Both the local and online evaluations demonstrate that our system significantly enhances complex reasoning capabilities. In local evaluations, we have significantly improved accuracy and reduced error rates compared to the baseline model, achieving a notable increase in scores. In the meanwhile, we have attained outstanding results in online assessments, demonstrating the performance and generalization capabilities of the proposed system. The source code for our system is released in \url{https://gitlab.aicrowd.com/shizueyy/crag-new}.
NoPhish: Efficient Chrome Extension for Phishing Detection Using Machine Learning Techniques
Thaqi, Leand, Halili, Arbnor, Vishi, Kamer, Rexha, Blerim
The growth of digitalization services via web browsers has simplified our daily routine of doing business. But at the same time, it has made the web browser very attractive for several cyber-attacks. Web phishing is a well-known cyberattack that is used by attackers camouflaging as trustworthy web servers to obtain sensitive user information such as credit card numbers, bank information, personal ID, social security number, and username and passwords. In recent years many techniques have been developed to identify the authentic web pages that users visit and warn them when the webpage is phishing. In this paper, we have developed an extension for Chrome the most favorite web browser, that will serve as a middleware between the user and phishing websites. The Chrome extension named "NoPhish" shall identify a phishing webpage based on several Machine Learning techniques. We have used the training dataset from "PhishTank" and extracted the 22 most popular features as rated by the Alexa database. The training algorithms used are Random Forest, Support Vector Machine, and k-Nearest Neighbor. The performance results show that Random Forest delivers the best precision.
Real-Time Weather Image Classification with SVM
Ship, Eden, Spivak, Eitan, Agarwal, Shubham, Birman, Raz, Hadar, Ofer
Accurate classification of weather conditions in images is essential for enhancing the performance of object detection and classification models under varying weather conditions. This paper presents a comprehensive study on classifying weather conditions in images into four categories: rainy, low light, haze, and clear. The motivation for this work stems from the need to improve the reliability and efficiency of automated systems, such as autonomous vehicles and surveillance, which must operate under diverse weather conditions. Misclassification of weather conditions can lead to significant performance degradation in these systems, making robust weather classification crucial. Utilizing the Support Vector Machine (SVM) algorithm, our approach leverages a robust set of features, including brightness, saturation, noise level, blur metric, edge strength, motion blur, Local Binary Patterns (LBP) mean and variance for radii 1, 2, and 3, edges mean and variance, and color histogram mean and variance for blue, green, and red channels. Our SVM-based method achieved a notable accuracy of 92.8%, surpassing typical benchmarks in the literature, which range from 80% to 90% for classical machine learning methods. While deep learning methods can achieve up to 94% accuracy, our approach offers a competitive advantage in terms of computational efficiency and real-time classification capabilities. Detailed analysis of each feature's contribution highlights the effectiveness of texture, color, and edge-related features in capturing the unique characteristics of different weather conditions. This research advances the state-of-the-art in weather image classification and provides insights into the critical features necessary for accurate weather condition differentiation, underscoring the potential of SVMs in practical applications where accuracy is paramount.
GL-TSVM: A robust and smooth twin support vector machine with guardian loss function
Akhtar, Mushir, Tanveer, M., Arshad, Mohd.
Twin support vector machine (TSVM), a variant of support vector machine (SVM), has garnered significant attention due to its $3/4$ times lower computational complexity compared to SVM. However, due to the utilization of the hinge loss function, TSVM is sensitive to outliers or noise. To remedy it, we introduce the guardian loss (G-loss), a novel loss function distinguished by its asymmetric, bounded, and smooth characteristics. We then fuse the proposed G-loss function into the TSVM and yield a robust and smooth classifier termed GL-TSVM. Further, to adhere to the structural risk minimization (SRM) principle and reduce overfitting, we incorporate a regularization term into the objective function of GL-TSVM. To address the optimization challenges of GL-TSVM, we devise an efficient iterative algorithm. The experimental analysis on UCI and KEEL datasets substantiates the effectiveness of the proposed GL-TSVM in comparison to the baseline models. Moreover, to showcase the efficacy of the proposed GL-TSVM in the biomedical domain, we evaluated it on the breast cancer (BreaKHis) and schizophrenia datasets. The outcomes strongly demonstrate the competitiveness of the proposed GL-TSVM against the baseline models.
Is text normalization relevant for classifying medieval charters?
Atzenhofer-Baumgartner, Florian, Kovács, Tamás
This study examines the impact of historical text normalization on the classification of medieval charters, specifically focusing on document dating and locating. Using a data set of Middle High German charters from a digital archive, we evaluate various classifiers, including traditional and transformer-based models, with and without normalization. Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating, implying that original texts contain crucial features that normalization may obscure. We find that support vector machines and gradient boosting outperform other models, questioning the efficiency of transformers for this use case. Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics that are critical for classification tasks in document analysis.
Enhancing Customer Churn Prediction in Telecommunications: An Adaptive Ensemble Learning Approach
Shaikhsurab, Mohammed Affan, Magadum, Pramod
Customer churn, the discontinuation of services by existing customers, poses a significant challenge to the telecommunications industry. This paper proposes a novel adaptive ensemble learning framework for highly accurate customer churn prediction. The framework integrates multiple base models, including XGBoost, LightGBM, LSTM, a Multi-Layer Perceptron (MLP) neural network, and Support Vector Machine (SVM). These models are strategically combined using a stacking ensemble method, further enhanced by meta-feature generation from base model predictions. A rigorous data preprocessing pipeline, coupled with a multi-faceted feature engineering approach, optimizes model performance. The framework is evaluated on three publicly available telecom churn datasets, demonstrating substantial accuracy improvements over state-of-the-art techniques. The research achieves a remarkable 99.28% accuracy, signifying a major advancement in churn prediction.The implications of this research for developing proactive customer retention strategies withinthe telecommunications industry are discussed.