Goto

Collaborating Authors

 Support Vector Machines


Application of machine learning models to predict the relationship between air pollution, ecosystem degradation, and health disparities and lung cancer in Vietnam

arXiv.org Artificial Intelligence

Lung cancer is one of the major causes of death worldwide, and Vietnam is not an exception. This disease is the second most common type of cancer globally and the second most common cause of death in Vietnam, just after liver cancer, with 23,797 fatal cases and 26,262 new cases, or 14.4% of the disease in 2020. Recently, with rising disease rates in Vietnam causing a huge public health burden, lung cancer continues to hold the top position in attention and care. Especially together with climate change, under a variety of types of pollution, deforestation, and modern lifestyles, lung cancer risks are on red alert, particularly in Vietnam. To understand more about the severe disease sources in Vietnam from a diversity of key factors, including environmental features and the current health state, with a particular emphasis on Vietnam's distinct socioeconomic and ecological context, we utilize large datasets such as patient health records and environmental indicators containing necessary information, such as deforestation rate, green cover rate, air pollution, and lung cancer risks, that is collected from well-known governmental sharing websites. Then, we process and connect them and apply analytical methods (heatmap, information gain, p-value, spearman correlation) to determine causal correlations influencing lung cancer risks. Moreover, we deploy machine learning (ML) models (Decision Tree, Random Forest, Support Vector Machine, K-mean clustering) to discover cancer risk patterns. Our experimental results, leveraged by the aforementioned ML models to identify the disease patterns, are promising, particularly, the models as Random Forest, SVM, and PCA are working well on the datasets and give high accuracy (99%), however, the K means clustering has very low accuracy (10%) and does not fit the datasets.


Lexical Bundle Frequency as a Construct-Relevant Candidate Feature in Automated Scoring of L2 Academic Writing

arXiv.org Artificial Intelligence

Automated scoring (AS) systems are increasingly used for evaluating L2 writing, but require ongoing refinement for construct validity. While prior work suggested lexical bundles (LBs) - recurrent multi-word sequences satisfying certain frequency criteria - could inform assessment, their empirical integration into AS models needs further investigation. This study tested the impact of incorporating LB frequency features into an AS model for TOEFL independent writing tasks. Analyzing a sampled subcorpus (N=1,225 essays, 9 L1s) from the TOEFL11 corpus, scored by ETS-trained raters (Low, Medium, High), 3- to 9-word LBs were extracted, distinguishing prompt-specific from non-prompt types. A baseline Support Vector Machine (SVM) scoring model using established linguistic features (e.g., mechanics, cohesion, sophistication) was compared against an extended model including three aggregate LB frequency features (total prompt, total non-prompt, overall total). Results revealed significant, though generally small-effect, relationships between LB frequency (especially non-prompt bundles) and proficiency (p < .05). Mean frequencies suggested lower proficiency essays used more LBs overall. Critically, the LB-enhanced model improved agreement with human raters (Quadratic Cohen's Kappa +2.05%, overall Cohen's Kappa +5.63%), with notable gains for low (+10.1% exact agreement) and medium (+14.3% Cohen's Kappa) proficiency essays. These findings demonstrate that integrating aggregate LB frequency offers potential for developing more linguistically informed and accurate AS systems, particularly for differentiating developing L2 writers.


A Novel Cholesky Kernel based Support Vector Classifier

arXiv.org Machine Learning

Support Vector Machine (SVM) is a popular supervised classification model that works by first finding the margin boundaries for the training data classes and then calculating the decision boundary, which is then used to classify the test data. This study demonstrates limitations of traditional support vector classification which uses cartesian coordinate geometry to find the margin and decision boundaries in an input space using only a few support vectors, without considering data variance and correlation. Subsequently, the study proposes a new Cholesky Kernel that adjusts for the effects of variance-covariance structure of the data in the decision boundary equation and margin calculations. The study demonstrates that SVM model is valid only in the Euclidean space, and the Cholesky kernel obtained by decomposing covariance matrix acts as a transformation matrix, which when applied on the original data transforms the data from the input space to the Euclidean space. The effectiveness of the Cholesky kernel based SVM classifier is demonstrated by classifying the Wisconsin Breast Cancer (Diagnostic) Dataset and comparing with traditional SVM approaches. The Cholesky kernel based SVM model shows marked improvement in the precision, recall and F1 scores compared to linear and other kernel SVMs.


Sentiment Classification of Thai Central Bank Press Releases Using Supervised Learning

arXiv.org Artificial Intelligence

Central bank communication plays a critical role in shaping economic expectations and monetary policy effectiveness. This study applies supervised machine learning techniques to classify the sentiment of press releases from the Bank of Thailand, addressing gaps in research that primarily focus on lexicon-based approaches. My findings show that supervised learning can be an effective method, even with smaller datasets, and serves as a starting point for further automation. However, achieving higher accuracy and better generalization requires a substantial amount of labeled data, which is time-consuming and demands expertise. Using models such as Na\"ive Bayes, Random Forest and SVM, this study demonstrates the applicability of machine learning for central bank sentiment analysis, with English-language communications from the Thai Central Bank as a case study.


A Data Balancing and Ensemble Learning Approach for Credit Card Fraud Detection

arXiv.org Artificial Intelligence

This research introduces an innovative method for identifying credit card fraud by combining the SMOTE-KMEANS technique with an ensemble machine learning model. The proposed model was benchmarked against traditional models such as logistic regression, decision trees, random forests, and support vector machines. Performance was evaluated using metrics, including accuracy, recall, and area under the curve (AUC). The results demonstrated that the proposed model achieved superior performance, with an AUC of 0.96 when combined with the SMOTE-KMEANS algorithm. This indicates a significant improvement in detecting fraudulent transactions while maintaining high precision and recall. The study also explores the application of different oversampling techniques to enhance the performance of various classifiers. The findings suggest that the proposed method is robust and effective for classification tasks on balanced datasets. Future research directions include further optimization of the SMOTE-KMEANS approach and its integration into existing fraud detection systems to enhance financial security and consumer protection.


A Systematic Review of EEG-based Machine Intelligence Algorithms for Depression Diagnosis, and Monitoring

arXiv.org Artificial Intelligence

Depression disorder is a serious health condition that has affected the lives of millions of people around the world. Diagnosis of depression is a challenging practice that relies heavily on subjective studies and, in most cases, suffers from late findings. Electroencephalography (EEG) biomarkers have been suggested and investigated in recent years as a potential transformative objective practice. In this article, for the first time, a detailed systematic review of EEG-based depression diagnosis approaches is conducted using advanced machine learning techniques and statistical analyses. For this, 938 potentially relevant articles (since 1985) were initially detected and filtered into 139 relevant articles based on the review scheme 'preferred reporting items for systematic reviews and meta-analyses (PRISMA).' This article compares and discusses the selected articles and categorizes them according to the type of machine learning techniques and statistical analyses. Algorithms, preprocessing techniques, extracted features, and data acquisition systems are discussed and summarized. This review paper explains the existing challenges of the current algorithms and sheds light on the future direction of the field. This systematic review outlines the issues and challenges in machine intelligence for the diagnosis of EEG depression that can be addressed in future studies and possibly in future wearable technologies.


ML-Based Bidding Price Prediction for Pay-As-Bid Ancillary Services Markets: A Use Case in the German Control Reserve Market

arXiv.org Machine Learning

The increasing integration of renewable energy sources has led to greater volatility and unpredictability in electricity generation, posing challenges to grid stability. Ancillary service markets, such as the German control reserve market, allow industrial consumers and producers to offer flexibility in their power consumption or generation, contributing to grid stability while earning additional income. However, many participants use simple bidding strategies that may not maximize their revenues. This paper presents a methodology for forecasting bidding prices in pay-as-bid ancillary service markets, focusing on the German control reserve market. We evaluate various machine learning models, including Support Vector Regression, Decision Trees, and k-Nearest Neighbors, and compare their performance against benchmark models. To address the asymmetry in the revenue function of pay-as-bid markets, we introduce an offset adjustment technique that enhances the practical applicability of the forecasting models. Our analysis demonstrates that the proposed approach improves potential revenues by 27.43 % to 37.31 % compared to baseline models. When analyzing the relationship between the model forecasting errors and the revenue, a negative correlation is measured for three markets; according to the results, a reduction of 1 EUR/MW model price forecasting error (MAE) statistically leads to a yearly revenue increase between 483 EUR/MW and 3,631 EUR/MW. The proposed methodology enables industrial participants to optimize their bidding strategies, leading to increased earnings and contributing to the efficiency and stability of the electrical grid.


Probabilistic Quantum SVM Training on Ising Machine

arXiv.org Artificial Intelligence

Quantum computing holds significant potential to accelerate machine learning algorithms, especially in solving optimization problems like those encountered in Support Vector Machine (SVM) training. However, current QUBO-based Quantum SVM (QSVM) methods rely solely on binary optimal solutions, limiting their ability to identify fuzzy boundaries in data. Additionally, the limited qubit count in contemporary quantum devices constrains training on larger datasets. In this paper, we propose a probabilistic quantum SVM training framework suitable for Coherent Ising Machines (CIMs). By formulating the SVM training problem as a QUBO model, we leverage CIMs' energy minimization capabilities and introduce a Boltzmann distribution-based probabilistic approach to better approximate optimal SVM solutions, enhancing robustness. To address qubit limitations, we employ batch processing and multi-batch ensemble strategies, enabling small-scale quantum devices to train SVMs on larger datasets and support multi-class classification tasks via a one-vs-one approach. Our method is validated through simulations and real-machine experiments on binary and multi-class datasets. On the banknote binary classification dataset, our CIM-based QSVM, utilizing an energy-based probabilistic approach, achieved up to 20% higher accuracy compared to the original QSVM, while training up to $10^4$ times faster than simulated annealing methods. Compared with classical SVM, our approach either matched or reduced training time. On the IRIS three-class dataset, our improved QSVM outperformed existing QSVM models in all key metrics. As quantum technology advances, increased qubit counts are expected to further enhance QSVM performance relative to classical SVM.


Machine Learning Techniques for Multifactor Analysis of National Carbon Dioxide Emissions

arXiv.org Artificial Intelligence

This paper presents a comprehensive study leveraging Support Vector Machine (SVM) regression and Principal Component Regression (PCR) to analyze carbon dioxide emissions in a global dataset of 62 countries and their dependence on idiosyncratic, country-specific parameters. The objective is to understand the factors contributing to carbon dioxide emissions and identify the most predictive elements. The analysis provides country-specific emission estimates, highlighting diverse national trajectories and pinpointing areas for targeted interventions in climate change mitigation, sustainable development, and the growing carbon credit markets and green finance sector. The study aims to support policymaking with accurate representations of carbon dioxide emissions, offering nuanced information for formulating effective strategies to address climate change while informing initiatives related to carbon trading and environmentally sustainable investments.


Robust Support Vector Machines for Imbalanced and Noisy Data via Benders Decomposition

arXiv.org Artificial Intelligence

This study introduces a novel formulation to enhance Support Vector Machines (SVMs) in handling class imbalance and noise. Unlike the conventional Soft Margin SVM, which penalizes the magnitude of constraint violations, the proposed model quantifies the number of violations and aims to minimize their frequency. To achieve this, a binary variable is incorporated into the objective function of the primal SVM formulation, replacing the traditional slack variable. Furthermore, each misclassified sample is assigned a priority and an associated constraint. The resulting formulation is a mixed-integer programming model, efficiently solved using Benders decomposition. The proposed model's performance was benchmarked against existing models, including Soft Margin SVM, weighted SVM, and NuSVC. Two primary hypotheses were examined: 1) The proposed model improves the F1-score for the minority class in imbalanced classification tasks. 2) The proposed model enhances classification accuracy in noisy datasets. These hypotheses were evaluated using a Wilcoxon test across multiple publicly available datasets from the OpenML repository. The results supported both hypotheses (\( p < 0.05 \)). In addition, the proposed model exhibited several interesting properties, such as improved robustness to noise, a decision boundary shift favoring the minority class, a reduced number of support vectors, and decreased prediction time. The open-source Python implementation of the proposed SVM model is available.