Support Vector Machines
Explainable AI for Comparative Analysis of Intrusion Detection Models
Corea, Pap M., Liu, Yongxin, Wang, Jian, Niu, Shuteng, Song, Houbing
Explainable Artificial Intelligence (XAI) has become a widely discussed topic, the related technologies facilitate better understanding of conventional black-box models like Random Forest, Neural Networks and etc. However, domain-specific applications of XAI are still insufficient. To fill this gap, this research analyzes various machine learning models to the tasks of binary and multi-class classification for intrusion detection from network traffic on the same dataset using occlusion sensitivity. The models evaluated include Linear Regression, Logistic Regression, Linear Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest, Decision Trees, and Multi-Layer Perceptrons (MLP). We trained all models to the accuracy of 90\% on the UNSW-NB15 Dataset. We found that most classifiers leverage only less than three critical features to achieve such accuracies, indicating that effective feature engineering could actually be far more important for intrusion detection than applying complicated models. We also discover that Random Forest provides the best performance in terms of accuracy, time efficiency and robustness. Data and code available at https://github.com/pcwhy/XML-IntrusionDetection.git
Optimizing PM2.5 Forecasting Accuracy with Hybrid Meta-Heuristic and Machine Learning Models
Ghafariasl, Parviz, Zeinalnezhad, Masoomeh, Ahmadishokooh, Amir
Timely alerts about hazardous air pollutants are crucial for public health. However, existing forecasting models often overlook key factors like baseline parameters and missing data, limiting their accuracy. This study introduces a hybrid approach to address these issues, focusing on forecasting hourly PM2.5 concentrations using Support Vector Regression (SVR). Meta-heuristic algorithms, Grey Wolf Optimization (GWO) and Particle Swarm Optimization (PSO), optimize SVR Hyper-parameters "C" and "Gamma" to enhance prediction accuracy. Evaluation metrics include R-squared (R2), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Results show significant improvements with PSO-SVR (R2: 0.9401, RMSE: 0.2390, MAE: 0.1368) and GWO-SVR (R2: 0.9408, RMSE: 0.2376, MAE: 0.1373), indicating robust and accurate models suitable for similar research applications.
Predicting Fairness of ML Software Configurations
Herrera, Salvador Robles, Monjezi, Verya, Kreinovich, Vladik, Trivedi, Ashutosh, Tizpaz-Niari, Saeid
This paper investigates the relationships between hyperparameters of machine learning and fairness. Data-driven solutions are increasingly used in critical socio-technical applications where ensuring fairness is important. Rather than explicitly encoding decision logic via control and data structures, the ML developers provide input data, perform some pre-processing, choose ML algorithms, and tune hyperparameters (HPs) to infer a program that encodes the decision logic. Prior works report that the selection of HPs can significantly influence fairness. However, tuning HPs to find an ideal trade-off between accuracy, precision, and fairness has remained an expensive and tedious task. Can we predict fairness of HP configuration for a given dataset? Are the predictions robust to distribution shifts? We focus on group fairness notions and investigate the HP space of 5 training algorithms. We first find that tree regressors and XGBoots significantly outperformed deep neural networks and support vector machines in accurately predicting the fairness of HPs. When predicting the fairness of ML hyperparameters under temporal distribution shift, the tree regressors outperforms the other algorithms with reasonable accuracy. However, the precision depends on the ML training algorithm, dataset, and protected attributes. For example, the tree regressor model was robust for training data shift from 2014 to 2018 on logistic regression and discriminant analysis HPs with sex as the protected attribute; but not for race and other training algorithms. Our method provides a sound framework to efficiently perform fine-tuning of ML training algorithms and understand the relationships between HPs and fairness.
Advantages of quantum support vector machine in cross-domain classification of quantum states
Sharma, Diksha, Sabale, Vivek Balasaheb, Singh, Parvinder, Kumar, Atul
In this study, we use cross-domain classification using quantum machine learning for quantum advantages to address the entanglement versus separability paradigm. We further demonstrate the efficient classification of Bell diagonal states into zero and non-zero discord classes. The inherited structure of quantum states and its relation with a particular class of quantum states are exploited to intuitively approach the classification of different domain testing states, referred here as crossdomain classification. In addition, we extend our analysis to evaluate the robustness of our model for the analyzed problem using random unitary transformations. Using numerical analysis, our results clearly demonstrate the potential of QSVM for classifying quantum states across the multidimensional Hilbert space.
Enhanced Heart Sound Classification Using Mel Frequency Cepstral Coefficients and Comparative Analysis of Single vs. Ensemble Classifier Strategies
Rahmani, Amir Masoud, Haider, Amir, Adeli, Mohammad, Mzoughi, Olfa, Gemeay, Entesar, Mohammadi, Mokhtar, Alinejad-Rokny, Hamid, Khoshvaght, Parisa, Hosseinzadeh, Mehdi
These authors contributed equally to this work. Abstract This paper explores the efficacy of Mel Frequency Cepstral Coefficients (MFCCs) in detecting abnormal heart sounds using two classification strategies: a single classifier and an ensemble classifier approach. Heart sounds were first pre-processed to remove noise and then segmented into S1, systole, S2, and diastole intervals, with thirteen MFCCs estimated from each segment, yielding 52 MFCCs per beat. Finally, MFCCs were used for heart sound classification. For that purpose, in the single classifier strategy, the MFCCs from nine consecutive beats were averaged to classify heart sounds by a single classifier (either a support vector machine (SVM), the k nearest neighbors (kNN), or a decision tree (DT)). Conversely, the ensemble classifier strategy employed nine classifiers (either nine SVMs, nine kNN classifiers, or nine DTs) to individually assess beats as normal or abnormal, with the overall classification based on the majority vote. Both methods were tested on a publicly available phonocardiogram database. The heart sound classification accuracy was 91.95% for the SVM, 91.9% for the kNN, and 87.33% for the DT in the single classifier strategy. Also, the accuracy was 93.59% for the SVM, 91.84% for the kNN, and 92.22% for the DT in the ensemble classifier strategy. Overall, the results demonstrated that the ensemble classifier strategy improved the accuracies of the DT and the SVM by 4.89% and 1.64%, establishing MFCCs as more effective than other features, including time, time-frequency, and statistical features, evaluated in similar studies.
Spatial features of CO2 for occupancy detection in a naturally ventilated school building
Huang, Qirui, Syndicus, Marc, Frisch, Jérôme, van Treeck, Christoph
Accurate occupancy information helps to improve building energy efficiency and occupant comfort. Occupancy detection methods based on CO2 sensors have received attention due to their low cost and low intrusiveness. In naturally ventilated buildings, the accuracy of CO2-based occupancy detection is generally low in related studies due to the complex ventilation behavior and the difficulty in measuring the actual air exchange through windows. In this study, we present two novel features for occupancy detection based on the spatial distribution of the CO2 concentration. After a quantitative analysis with Support Vector Machine (SVM) as classifier, it was found that the accuracy of occupancy state detection in naturally ventilated rooms could be improved by up to 14.8 percentage points compared to the baseline, reaching 83.2 % (F1 score 0.84) without any ventilation information. With ventilation information, the accuracy reached 87.6 % (F1 score 0.89). The performance of occupancy quantity detection was significantly improved by up to 25.3 percentage points versus baseline, reaching 56 %, with root mean square error (RMSE) of 11.44 occupants, using only CO2-related features. Additional ventilation information further enhanced the performance to 61.8 % (RMSE 9.02 occupants). By incorporating spatial features, the model using only CO2-related features revealed similar performance as the model containing additional ventilation information, resulting in a better low-cost occupancy detection method for naturally ventilated buildings.
A Fast Learning-Based Surrogate of Electrical Machines using a Reduced Basis
Ribés, Alejandro, Benchekroun, Nawfal, Delagnes, Théo
A surrogate model approximates the outputs of a solver of Partial Differential Equations (PDEs) with a low computational cost. In this article, we propose a method to build learning-based surrogates in the context of parameterized PDEs, which are PDEs that depend on a set of parameters but are also temporal and spatial processes. Our contribution is a method hybridizing the Proper Orthogonal Decomposition and several Support Vector Regression machines. This method is conceived to work in real-time, thus aimed for being used in the context of digital twins, where a user can perform an interactive analysis of results based on the proposed surrogate. We present promising results on two use cases concerning electrical machines. These use cases are not toy examples but are produced an industrial computational code, they use meshes representing non-trivial geometries and contain non-linearities.
Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment
Gerczuk, Maurice, Amiriparian, Shahin, Lutz, Justina, Strube, Wolfgang, Papazova, Irina, Hasan, Alkomiet, Schuller, Björn W.
In emergency medicine, timely intervention for patients at risk of suicide is often hindered by delayed access to specialised psychiatric care. To bridge this gap, we introduce a speech-based approach for automatic suicide risk assessment. Our study involves a novel dataset comprising speech recordings of 20 patients who read neutral texts. We extract four speech representations encompassing interpretable and deep features. Further, we explore the impact of gender-based modelling and phrase-level normalisation. By applying gender-exclusive modelling, features extracted from an emotion fine-tuned wav2vec2.0 model can be utilised to discriminate high- from low- suicide risk with a balanced accuracy of 81%. Finally, our analysis reveals a discrepancy in the relationship of speech characteristics and suicide risk between female and male subjects. For men in our dataset, suicide risk increases together with agitation while voice characteristics of female subjects point the other way.
An Autotuning-based Optimization Framework for Mixed-kernel SVM Classifications in Smart Pixel Datasets and Heterojunction Transistors
Wu, Xingfu, Oli, Tupendra, Qian, ustin H., Taylor, Valerie, Hersam, Mark C., Sangwan, Vinod K.
Support Vector Machine (SVM) is a state-of-the-art classification method widely used in science and engineering due to its high accuracy, its ability to deal with high dimensional data, and its flexibility in modeling diverse sources of data. In this paper, we propose an autotuning-based optimization framework to quantify the ranges of hyperparameters in SVMs to identify their optimal choices, and apply the framework to two SVMs with the mixed-kernel between Sigmoid and Gaussian kernels for smart pixel datasets in high energy physics (HEP) and mixed-kernel heterojunction transistors (MKH). Our experimental results show that the optimal selection of hyperparameters in the SVMs and the kernels greatly varies for different applications and datasets, and choosing their optimal choices is critical for a high classification accuracy of the mixed kernel SVMs. Uninformed choices of hyperparameters C and coef0 in the mixed-kernel SVMs result in severely low accuracy, and the proposed framework effectively quantifies the proper ranges for the hyperparameters in the SVMs to identify their optimal choices to achieve the highest accuracy 94.6\% for the HEP application and the highest average accuracy 97.2\% with far less tuning time for the MKH application.
A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning
Olama, Alireza, Lundell, Andreas, Kronqvist, Jan, Ahmadi, Elham, Camponogara, Eduardo
This paper introduces the Bi-linear consensus Alternating Direction Method of Multipliers (Bi-cADMM), aimed at solving large-scale regularized Sparse Machine Learning (SML) problems defined over a network of computational nodes. Mathematically, these are stated as minimization problems with convex local loss functions over a global decision vector, subject to an explicit $\ell_0$ norm constraint to enforce the desired sparsity. The considered SML problem generalizes different sparse regression and classification models, such as sparse linear and logistic regression, sparse softmax regression, and sparse support vector machines. Bi-cADMM leverages a bi-linear consensus reformulation of the original non-convex SML problem and a hierarchical decomposition strategy that divides the problem into smaller sub-problems amenable to parallel computing. In Bi-cADMM, this decomposition strategy is based on a two-phase approach. Initially, it performs a sample decomposition of the data and distributes local datasets across computational nodes. Subsequently, a delayed feature decomposition of the data is conducted on Graphics Processing Units (GPUs) available to each node. This methodology allows Bi-cADMM to undertake computationally intensive data-centric computations on GPUs, while CPUs handle more cost-effective computations. The proposed algorithm is implemented within an open-source Python package called Parallel Sparse Fitting Toolbox (PsFiT), which is publicly available. Finally, computational experiments demonstrate the efficiency and scalability of our algorithm through numerical benchmarks across various SML problems featuring distributed datasets.