Goto

Collaborating Authors

 Support Vector Machines


Quantum-Hybrid Support Vector Machines for Anomaly Detection in Industrial Control Systems

arXiv.org Artificial Intelligence

Sensitive data captured by Industrial Control Systems (ICS) play a large role in the safety and integrity of many critical infrastructures. Detection of anomalous or malicious data, or Anomaly Detection (AD), with machine learning is one of many vital components of cyberphysical security. Quantum kernel-based machine learning methods have shown promise in identifying complex anomalous behavior by leveraging the highly expressive and efficient feature spaces of quantum computing. This study focuses on the parameterization of Quantum Hybrid Support Vector Machines (QSVMs) using three popular datasets from Cyber-Physical Systems (CPS). The results demonstrate that QSVMs outperform traditional classical kernel methods, achieving 13.3% higher F1 scores. Additionally, this research investigates noise using simulations based on real IBMQ hardware, revealing a maximum error of only 0.98% in the QSVM kernels. This error results in an average reduction of 1.57% in classification metrics. Furthermore, the study found that QSVMs show a 91.023% improvement in kernel-target alignment compared to classical methods, indicating a potential "quantum advantage" in anomaly detection for critical infrastructures. This effort suggests that QSVMs can provide a substantial advantage in anomaly detection for ICS, ultimately enhancing the security and integrity of critical infrastructures.


PuckTrick: A Library for Making Synthetic Data More Realistic

arXiv.org Artificial Intelligence

The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance, offering a structured approach to evaluating ML model resilience under real-world data imperfections. Pucktrick provides two contamination modes: one for injecting errors into clean datasets and another for further corrupting already contaminated datasets. Through extensive experiments on real-world financial datasets, we evaluate the impact of systematic data contamination on model performance. Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data, particularly for tree-based and linear models such as SVMs and Extra Trees.


Predicting Mild Cognitive Impairment Using Naturalistic Driving and Trip Destination Modeling

arXiv.org Artificial Intelligence

Understanding the relationship between mild cognitive impairment (MCI) and driving behavior is essential for enhancing road safety, particularly among older adults. This study introduces a novel approach by incorporating specific trip destinations-such as home, work, medical appointments, social activities, and errands-using geohashing to analyze the driving habits of older drivers in Nebraska. We employed a two-fold methodology that combines data visualization with advanced machine learning models, including C5.0, Random Forest, and Support Vector Machines, to assess the effectiveness of these location-based variables in predicting cognitive impairment. Notably, the C5.0 model showed a robust and stable performance, achieving a median recall of 0.68, which indicates that our methodology accurately identifies cognitive impairment in drivers 68\% of the time. This emphasizes our model's capacity to reduce false negatives, a crucial factor given the profound implications of failing to identify impaired drivers. Our findings underscore the innovative use of life-space variables in understanding and predicting cognitive decline, offering avenues for early intervention and tailored support for affected individuals.


Boltzmann Classifier: A Thermodynamic-Inspired Approach to Supervised Learning

arXiv.org Artificial Intelligence

We present the Boltzmann classifier, a novel distance based probabilistic classification algorithm inspired by the Boltzmann distribution. Unlike traditional classifiers that produce hard decisions or uncalibrated probabilities, the Boltzmann classifier assigns class probabilities based on the average distance to the nearest neighbors within each class, providing interpretable, physically meaningful outputs. We evaluate the performance of the method across three application domains: molecular activity prediction, oxidation state classification of transition metal complexes, and breast cancer diagnosis. In the molecular activity task, the classifier achieved the highest accuracy in predicting active compounds against two protein targets, with strong correlations observed between the predicted probabilities and experimental pIC50 values. For metal complexes, the classifier accurately distinguished between oxidation states II and III for Fe, Mn, and Co, using only metal-ligand bond lengths extracted from crystallographic data, and demonstrated high consistency with known chemical trends. In the breast cancer dataset, the classifier achieved 97% accuracy, with low confidence predictions concentrated in inherently ambiguous cases. Across all tasks, the Boltzmann classifier performed competitively or better than standard models such as logistic regression, support vector machines, random forests, and k-nearest neighbors. Its probabilistic outputs were found to correlate with continuous physical or biological properties, highlighting its potential utility in both classification and regression contexts. The results suggest that the Boltzmann classifier is a robust and interpretable alternative to conventional machine learning approaches, particularly in scientific domains where underlying structure property relationships are important.


Determinaรงรฃo Automรกtica de Limiar de Detecรงรฃo de Ataques em Redes de Computadores Utilizando Autoencoders

arXiv.org Artificial Intelligence

Currently, digital security mechanisms like Anomaly Detection Systems using Autoencoders (AE) show great potential for bypassing problems intrinsic to the data, such as data imbalance. Because AE use a non-trivial and nonstandardized separation threshold to classify the extracted reconstruction error, the definition of this threshold directly impacts the performance of the detection process. Thus, this work proposes the automatic definition of this threshold using some machine learning algorithms. For this, three algorithms were evaluated: the K-Nearst Neighbors, the K-Means and the Support Vector Machine.


Generalized Reference Kernel With Negative Samples For Support Vector One-class Classification

arXiv.org Artificial Intelligence

This paper focuses on small-scale one-class classification with some negative samples available. We propose Generalized Reference Kernel with Negative Samples (GRKneg) for One-class Support Vector Machine (OC-SVM). We study different ways to select/generate the reference vectors and recommend an approach for the problem at hand. It is worth noting that the proposed method does not use any labels in the model optimization but uses the original OC-SVM implementation. Only the kernel used in the process is improved using the negative data. We compare our method with the standard OC-SVM and with the binary Support Vector Machine (SVM) using different amounts of negative samples. Our approach consistently outperforms the standard OC-SVM using Radial Basis Function kernel. When there are plenty of negative samples, the binary SVM outperforms the one-class approaches as expected, but we show that for the lowest numbers of negative samples the proposed approach clearly outperforms the binary SVM.


Predicting Anthropometric Body Composition Variables Using 3D Optical Imaging and Machine Learning

arXiv.org Artificial Intelligence

Accurate prediction of anthropometric body composition variables, such as Appendicular Lean Mass (ALM), Body Fat Percentage (BFP), and Bone Mineral Density (BMD), is essential for early diagnosis of several chronic diseases. Currently, researchers rely on Dual-Energy X-ray Absorptiometry (DXA) scans to measure these metrics; however, DXA scans are costly and time-consuming. This work proposes an alternative to DXA scans by applying statistical and machine learning models on biomarkers (height, volume, left calf circumference, etc) obtained from 3D optical images. The dataset consists of 847 patients and was sourced from Pennington Biomedical Research Center. Extracting patients' data in healthcare faces many technical challenges and legal restrictions. However, most supervised machine learning algorithms are inherently data-intensive, requiring a large amount of training data. To overcome these limitations, we implemented a semi-supervised model, the $p$-Laplacian regression model. This paper is the first to demonstrate the application of a $p$-Laplacian model for regression. Our $p$-Laplacian model yielded errors of $\sim13\%$ for ALM, $\sim10\%$ for BMD, and $\sim20\%$ for BFP when the training data accounted for 10 percent of all data. Among the supervised algorithms we implemented, Support Vector Regression (SVR) performed the best for ALM and BMD, yielding errors of $\sim 8\%$ for both, while Least Squares SVR performed the best for BFP with $\sim 11\%$ error when trained on 80 percent of the data. Our findings position the $p$-Laplacian model as a promising tool for healthcare applications, particularly in a data-constrained environment.


Busting the Paper Ballot: Voting Meets Adversarial Machine Learning

arXiv.org Artificial Intelligence

We show the security risk associated with using machine learning classifiers in United States election tabulators. The central classification task in election tabulation is deciding whether a mark does or does not appear on a bubble associated to an alternative in a contest on the ballot. Barretto et al. (E-Vote-ID 2021) reported that convolutional neural networks are a viable option in this field, as they outperform simple feature-based classifiers. Our contributions to election security can be divided into four parts. To demonstrate and analyze the hypothetical vulnerability of machine learning models on election tabulators, we first introduce four new ballot datasets. Second, we train and test a variety of different models on our new datasets. These models include support vector machines, convolutional neural networks (a basic CNN, VGG and ResNet), and vision transformers (Twins and CaiT). Third, using our new datasets and trained models, we demonstrate that traditional white box attacks are ineffective in the voting domain due to gradient masking. Our analyses further reveal that gradient masking is a product of numerical instability. We use a modified difference of logits ratio loss to overcome this issue (Croce and Hein, ICML 2020). Fourth, in the physical world, we conduct attacks with the adversarial examples generated using our new methods. In traditional adversarial machine learning, a high (50% or greater) attack success rate is ideal. However, for certain elections, even a 5% attack success rate can flip the outcome of a race. We show such an impact is possible in the physical domain. We thoroughly discuss attack realism, and the challenges and practicality associated with printing and scanning ballot adversarial examples.


Node Splitting SVMs for Survival Trees Based on an L2-Regularized Dipole Splitting Criteria

arXiv.org Machine Learning

This paper proposes a novel, node-splitting support vector machine (SVM) for creating survival trees. This approach is capable of non-linearly partitioning survival data which includes continuous, right-censored outcomes. Our method improves on an existing non-parametric method, which uses at most oblique splits to induce survival regression trees. In the prior work, these oblique splits were created via a non-SVM approach, by minimizing a piece-wise linear objective, called a dipole splitting criterion, constructed from pairs of covariates and their associated survival information. We extend this method by enabling splits from a general class of non-linear surfaces. We achieve this by ridge regularizing the dipole-splitting criterion to enable application of kernel methods in a manner analogous to classical SVMs. The ridge regularization provides robustness and can be tuned. Using various kernels, we induce both linear and non-linear survival trees to compare their sizes and predictive powers on real and simulated data sets. We compare traditional univariate log-rank splits, oblique splits using the original dipole-splitting criterion and a variety of non-linear splits enabled by our method. In these tests, trees created by non-linear splits, using polynomial and Gaussian kernels show similar predictive power while often being of smaller sizes compared to trees created by univariate and oblique splits. This approach provides a novel and flexible array of survival trees that can be applied to diverse survival data sets.


Predicting Patient Survival with Airway Biomarkers using nn-Unet/Radiomics

arXiv.org Artificial Intelligence

The primary objective of the AIIB 2023 competition is to evaluate the predictive significance of airway-related imaging biomarkers in determining the survival outcomes of patients with lung fibrosis.This study introduces a comprehensive three-stage approach. Initially, a segmentation network, namely nn-Unet, is employed to delineate the airway's structural boundaries. Subsequently, key features are extracted from the radiomic images centered around the trachea and an enclosing bounding box around the airway. This step is motivated by the potential presence of critical survival-related insights within the tracheal region as well as pertinent information encoded in the structure and dimensions of the airway. Lastly, radiomic features obtained from the segmented areas are integrated into an SVM classifier. We could obtain an overall-score of 0.8601 for the segmentation in Task 1 while 0.7346 for the classification in Task 2.