Support Vector Machines
The Role of Hyperparameters in Predictive Multiplicity
Cavus, Mustafa, Woลบnica, Katarzyna, Biecek, Przemysลaw
This paper investigates the critical role of hyperparameters in predictive multiplicity, where different machine learning models trained on the same dataset yield divergent predictions for identical inputs. These inconsistencies can seriously impact high-stakes decisions such as credit assessments, hiring, and medical diagnoses. Focusing on six widely used models for tabular data - Elastic Net, Decision Tree, k-Nearest Neighbor, Support Vector Machine, Random Forests, and Extreme Gradient Boosting - we explore how hyperparameter tuning influences predictive multiplicity, as expressed by the distribution of prediction discrepancies across benchmark datasets. Key hyperparameters such as lambda in Elastic Net, gamma in Support Vector Machines, and alpha in Extreme Gradient Boosting play a crucial role in shaping predictive multiplicity, often compromising the stability of predictions within specific algorithms. Our experiments on 21 benchmark datasets reveal that tuning these hyperparameters leads to notable performance improvements but also increases prediction discrepancies, with Extreme Gradient Boosting exhibiting the highest discrepancy and substantial prediction instability. This highlights the trade-off between performance optimization and prediction consistency, raising concerns about the risk of arbitrary predictions. These findings provide insight into how hyperparameter optimization leads to predictive multiplicity. While predictive multiplicity allows prioritizing domain-specific objectives such as fairness and reduces reliance on a single model, it also complicates decision-making, potentially leading to arbitrary or unjustified outcomes.
Single-Qudit Quantum Neural Networks for Multiclass Classification
Souza, Leandro C., Portugal, Renato
This paper proposes a single-qudit quantum neural network for multiclass classification, by using the enhanced representational capacity of high-dimensional qudit states. Our design employs an $d$-dimensional unitary operator, where $d$ corresponds to the number of classes, constructed using the Cayley transform of a skew-symmetric matrix, to efficiently encode and process class information. This architecture enables a direct mapping between class labels and quantum measurement outcomes, reducing circuit depth and computational overhead. To optimize network parameters, we introduce a hybrid training approach that combines an extended activation function -- derived from a truncated multivariable Taylor series expansion -- with support vector machine optimization for weight determination. We evaluate our model on the MNIST and EMNIST datasets, demonstrating competitive accuracy while maintaining a compact single-qudit quantum circuit. Our findings highlight the potential of qudit-based QNNs as scalable alternatives to classical deep learning models, particularly for multiclass classification. However, practical implementation remains constrained by current quantum hardware limitations. This research advances quantum machine learning by demonstrating the feasibility of higher-dimensional quantum systems for efficient learning tasks.
Optimizing Fire Safety: Reducing False Alarms Using Advanced Machine Learning Techniques
Jamal, Muhammad Hassan, Alazeb, Abdulwahab, Bakhsh, Shahid Allah, Boulila, Wadii, Shah, Syed Aziz, Khattak, Aizaz Ahmad, Khan, Muhammad Shahbaz
Fire safety practices are important to reduce the extent of destruction caused by fire. While smoke alarms help save lives, firefighters struggle with the increasing number of false alarms. This paper presents a precise and efficient Weighted ensemble model for decreasing false alarms. It estimates the density, computes weights according to the high and low-density regions, forwards the high region weights to KNN and low region weights to XGBoost and combines the predictions. The proposed model is effective at reducing response time, increasing fire safety, and minimizing the damage that fires cause. A specifically designed dataset for smoke detection is utilized to test the proposed model. In addition, a variety of ML models, such as Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Nai:ve Bayes (NB), K-Nearest Neighbour (KNN), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Adaptive Boosting (ADAB), have also been utilized. To maximize the use of the smoke detection dataset, all the algorithms utilize the SMOTE re-sampling technique. After evaluating the assessment criteria, this paper presents a concise summary of the comprehensive findings obtained by comparing the outcomes of all models.
Automatic welding detection by an intelligent tool pipe inspection
Arizmendi, C J, Garcia, W L, Quintero, M A
This work provide a model based on machine learning techniques in welds recognition, based on signals obtained through in-line inspection tool called "smart pig" in Oil and Gas pipelines. The model uses a signal noise reduction phase by means of pre-processing algorithms and attribute-selection techniques. The noise reduction techniques were selected after a literature review and testing with survey data. Subsequently, the model was trained using recognition and classification algorithms, specifically artificial neural networks and support vector machines. Finally, the trained model was validated with different data sets and the performance was measured with cross validation and ROC analysis. The results show that is possible to identify welding automatically with an efficiency between 90 and 98 percent.
Machine learning algorithms to predict stroke in China based on causal inference of time series analysis
Zheng, Qizhi, Zhao, Ayang, Wang, Xinzhu, Bai, Yanhong, Wang, Zikun, Wang, Xiuying, Zeng, Xianzhang, Dong, Guanghui
Participants: This study employed a combination of Vector Autoregression (VAR) model and Graph Neural Networks (GNN) to systematically construct dynamic causal inference. Multiple classic classification algorithms were compared, including Random Forest, Logistic Regression, XGBoost, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Gradient Boosting, and Multi Layer Perceptron (MLP). The SMOTE algorithm was used to undersample a small number of samples and employed Stratified K-fold Cross Validation. Results: This study included a total of 11,789 participants, including 6,334 females (53.73%) and 5,455 males (46.27%), with an average age of 65 years. Introduction of dynamic causal inference features has significantly improved the performance of almost all models. The area under the ROC curve of each model ranged from 0.78 to 0.83, indicating significant difference (P < 0.01). Among all the models, the Gradient Boosting model demonstrated the highest performance and stability. Model explanation and feature importance analysis generated model interpretation that illustrated significant contributors associated with risks of stroke. Conclusions and Relevance: This study proposes a stroke risk prediction method that combines dynamic causal inference with machine learning models, significantly improving prediction accuracy and revealing key health factors that affect stroke. The research results indicate that dynamic causal inference features have important value in predicting stroke risk, especially in capturing the impact of changes in health status over time on stroke risk. By further optimizing the model and introducing more variables, this study provides theoretical basis and practical guidance for future stroke prevention and intervention strategies.
Leveraging Large Language Models to Address Data Scarcity in Machine Learning: Applications in Graphene Synthesis
Biswajeet, Devi Dutta, Kadkhodaei, Sara
Machine learning in materials science faces challenges due to limited experimental data, as generating synthesis data is costly and time-consuming, especially with in-house experiments. Mining data from existing literature introduces issues like mixed data quality, inconsistent formats, and variations in reporting experimental parameters, complicating the creation of consistent features for the learning algorithm. Additionally, combining continuous and discrete features can hinder the learning process with limited data. Here, we propose strategies that utilize large language models (LLMs) to enhance machine learning performance on a limited, heterogeneous dataset of graphene chemical vapor deposition synthesis compiled from existing literature. These strategies include prompting modalities for imputing missing data points and leveraging large language model embeddings to encode the complex nomenclature of substrates reported in chemical vapor deposition experiments. The proposed strategies enhance graphene layer classification using a support vector machine (SVM) model, increasing binary classification accuracy from 39% to 65% and ternary accuracy from 52% to 72%. We compare the performance of the SVM and a GPT-4 model, both trained and fine-tuned on the same data. Our results demonstrate that the numerical classifier, when combined with LLM-driven data enhancements, outperforms the standalone LLM predictor, highlighting that in data-scarce scenarios, improving predictive learning with LLM strategies requires more than simple fine-tuning on datasets. Instead, it necessitates sophisticated approaches for data imputation and feature space homogenization to achieve optimal performance. The proposed strategies emphasize data enhancement techniques, offering a broadly applicable framework for improving machine learning performance on scarce, inhomogeneous datasets.
Efficient Distributed Learning over Decentralized Networks with Convoluted Support Vector Machine
Chen, Canyi, Qiao, Nan, Zhu, Liping
Massive datasets, characterized by both large sample sizes and high-dimensional features, are increasingly prevalent across diverse fields. For example, the 1000 Genomes Project Consortium et al. (2015) study amassed genomic data from 2,504 individuals spanning 26 populations, yielding approximately 12 terabytes data. Often, such datasets are distributed across multiple locations. Fusing data together for centralized statistical analysis is somehow infeasible due to concerns over data privacy, memory and storage limitations, and bandwidth constraints. The absence of fusion centers has thus fueled interest in decentralized distributed learning--a paradigm that fully exploits distributed datasets by performing computations locally. This methodology has found successful applications in fields such as personalized medicine, edge computing, smart utilities, and dimension reduction (Li et al., 2011). A fundamental task in these applications is classification. Penalized support vector machines (SVMs) have been enduringly powerful tools for high-dimensional classification tasks, building on the seminal contributions of Boser et al. (1992) and Vapnik (2000). The standard objective function for penalized SVMs combines the hinge loss with a penalty term.
Statistical Study of Sensor Data and Investigation of ML-based Calibration Algorithms for Inexpensive Sensor Modules: Experiments from Cape Point
Barrett, Travis, Mishra, Amit Kumar
In this paper we present the statistical analysis of data from inexpensive sensors. We also present the performance of machine learning algorithms when used for automatic calibration such sensors. In this we have used low-cost Non-Dispersive Infrared CO$_2$ sensor placed at a co-located site at Cape Point, South Africa (maintained by Weather South Africa). The collected low-cost sensor data and site truth data are investigated and compared. We compare and investigate the performance of Random Forest Regression, Support Vector Regression, 1D Convolutional Neural Network and 1D-CNN Long Short-Term Memory Network models as a method for automatic calibration and the statistical properties of these model predictions. In addition, we also investigate the drift in performance of these algorithms with time.
Agile Climate-Sensor Design and Calibration Algorithms Using Machine Learning: Experiments From Cape Point
Barrett, Travis, Mishra, Amit Kumar
In this paper, we describe the design of an inexpensive and agile climate sensor system which can be repurposed easily to measure various pollutants. We also propose the use of machine learning regression methods to calibrate CO2 data from this cost-effective sensing platform to a reference sensor at the South African Weather Service's Cape Point measurement facility. We show the performance of these methods and found that Random Forest Regression was the best in this scenario. This shows that these machine learning methods can be used to improve the performance of cost-effective sensor platforms and possibly extend the time between manual calibration of sensor networks.
Accurate predictive model of band gap with selected important features based on explainable machine learning
In the rapidly advancing field of materials informatics, nonlinear machine learning models have demonstrated exceptional predictive capabilities for material properties. However, their black-box nature limits interpretability, and they may incorporate features that do not contribute to--or even deteriorate--model performance. This study employs explainable ML (XML) techniques, including permutation feature importance and the SHapley Additive exPlanation, applied to a pristine support vector regression model designed to predict band gaps at the GW level using 18 input features. Guided by XML-derived individual feature importance, a simple framework is proposed to construct reduced-feature predictive models. Model evaluations indicate that an XML-guided compact model, consisting of the top five features, achieves comparable accuracy to the pristine model on in-domain datasets while demonstrating superior generalization with lower prediction errors on out-of-domain data. Additionally, the study underscores the necessity for eliminating strongly correlated features to prevent misinterpretation and overestimation of feature importance before applying XML. This study highlights XML's effectiveness in developing simplified yet highly accurate machine learning models by clarifying feature roles.