Goto

Collaborating Authors

 Decision Tree Learning


SARF: Enhancing Stock Market Prediction with Sentiment-Augmented Random Forest

arXiv.org Artificial Intelligence

Stock trend forecasting, a challenging problem in the financial domain, involves ex-tensive data and related indicators. Relying solely on empirical analysis often yields unsustainable and ineffective results. Machine learning researchers have demonstrated that the application of random forest algorithm can enhance predictions in this context, playing a crucial auxiliary role in forecasting stock trends. This study introduces a new approach to stock market prediction by integrating sentiment analysis using FinGPT generative AI model with the traditional Random Forest model. The proposed technique aims to optimize the accuracy of stock price forecasts by leveraging the nuanced understanding of financial sentiments provided by FinGPT. We present a new methodology called "Sentiment-Augmented Random Forest" (SARF), which in-corporates sentiment features into the Random Forest framework. Our experiments demonstrate that SARF outperforms conventional Random Forest and LSTM models with an average accuracy improvement of 9.23% and lower prediction errors in pre-dicting stock market movements.


PyGRF: An improved Python Geographical Random Forest model and case studies in public health and natural disasters

arXiv.org Artificial Intelligence

Geographical random forest (GRF) is a recently developed and spatially explicit machine learning model. With the ability to provide more accurate predictions and local interpretations, GRF has already been used in many studies. The current GRF model, however, has limitations in its determination of the local model weight and bandwidth hyperparameters, potentially insufficient numbers of local training samples, and sometimes high local prediction errors. Also, implemented as an R package, GRF currently does not have a Python version which limits its adoption among machine learning practitioners who prefer Python. This work addresses these limitations by introducing theory-informed hyperparameter determination, local training sample expansion, and spatially-weighted local prediction. We also develop a Python-based GRF model and package, PyGRF, to facilitate the use of the model. We evaluate the performance of PyGRF on an example dataset and further demonstrate its use in two case studies in public health and natural disasters.


Optimal or Greedy Decision Trees? Revisiting their Objectives, Tuning, and Performance

arXiv.org Artificial Intelligence

Decision trees are traditionally trained using greedy heuristics that locally optimize an impurity or information metric. Recently there has been a surge of interest in optimal decision tree (ODT) methods that globally optimize accuracy directly. We identify two relatively unexplored aspects of ODTs: the objective function used in training trees and tuning techniques. Additionally, the value of optimal methods is not well understood yet, as the literature provides conflicting results, with some demonstrating superior out-of-sample performance of ODTs over greedy approaches, while others show the exact opposite. In this paper, we address these three questions: what objective to optimize in ODTs; how to tune ODTs; and how do optimal and greedy methods compare? Our experimental evaluation examines 13 objective functions, including four novel objectives resulting from our analysis, seven tuning methods, and six claims from the literature on optimal and greedy methods on 165 real and synthetic data sets. Through our analysis, both conceptually and experimentally, we discover new non-concave objectives, highlight the importance of proper tuning, support and refute several claims from the literature, and provide clear recommendations for researchers and practitioners on the usage of greedy and optimal methods, and code for future comparisons.


Combustion Condition Identification using a Decision Tree based Machine Learning Algorithm Applied to a Model Can Combustor with High Shear Swirl Injector

arXiv.org Artificial Intelligence

Combustion is the primary process in gas turbine engines, where there is a need for efficient air-fuel mixing to enhance performance. High-shear swirl injectors are commonly used to improve fuel atomization and mixing, which are key factors in determining combustion efficiency and emissions. However, under certain conditions, combustors can experience thermoacoustic instability. In this study, a decision tree-based machine learning algorithm is used to classify combustion conditions by analyzing acoustic pressure and high-speed flame imaging from a counter-rotating high-shear swirl injector of a single can combustor fueled by methane. With a constant Reynolds number and varying equivalence ratios, the combustor exhibits both stable and unstable states. Characteristic features are extracted from the data using time series analysis, providing insight into combustion dynamics. The trained supervised machine learning model accurately classifies stable and unstable operations, demonstrating effective prediction of combustion conditions within the studied parameter range.


Performance of Cross-Validated Targeted Maximum Likelihood Estimation

arXiv.org Machine Learning

Background: Advanced methods for causal inference, such as targeted maximum likelihood estimation (TMLE), require certain conditions for statistical inference. However, in situations where there is not differentiability due to data sparsity or near-positivity violations, the Donsker class condition is violated. In such situations, TMLE variance can suffer from inflation of the type I error and poor coverage, leading to conservative confidence intervals. Cross-validation of the TMLE algorithm (CVTMLE) has been suggested to improve on performance compared to TMLE in settings of positivity or Donsker class violations. We aim to investigate the performance of CVTMLE compared to TMLE in various settings. Methods: We utilised the data-generating mechanism as described in Leger et al. (2022) to run a Monte Carlo experiment under different Donsker class violations. Then, we evaluated the respective statistical performances of TMLE and CVTMLE with different super learner libraries, with and without regression tree methods. Results: We found that CVTMLE vastly improves confidence interval coverage without adversely affecting bias, particularly in settings with small sample sizes and near-positivity violations. Furthermore, incorporating regression trees using standard TMLE with ensemble super learner-based initial estimates increases bias and variance leading to invalid statistical inference. Conclusions: It has been shown that when using CVTMLE the Donsker class condition is no longer necessary to obtain valid statistical inference when using regression trees and under either data sparsity or near-positivity violations. We show through simulations that CVTMLE is much less sensitive to the choice of the super learner library and thereby provides better estimation and inference in cases where the super learner library uses more flexible candidates and is prone to overfitting.


Incremental and Data-Efficient Concept Formation to Support Masked Word Prediction

arXiv.org Artificial Intelligence

This paper introduces Cobweb4L, a novel approach for efficient language model learning that supports masked word prediction. The approach builds on Cobweb, an incremental system that learns a hierarchy of probabilistic concepts. Each concept stores the frequencies of words that appear in instances tagged with that concept label. The system utilizes an attribute value representation to encode words and their surrounding context into instances. Cobweb4L uses the information theoretic variant of category utility and a new performance mechanism that leverages multiple concepts to generate predictions. We demonstrate that with these extensions it significantly outperforms prior Cobweb performance mechanisms that use only a single node to generate predictions. Further, we demonstrate that Cobweb4L learns rapidly and achieves performance comparable to and even superior to Word2Vec. Next, we show that Cobweb4L and Word2Vec outperform BERT in the same task with less training data. Finally, we discuss future work to make our conclusions more robust and inclusive.


OATH: Efficient and Flexible Zero-Knowledge Proofs of End-to-End ML Fairness

arXiv.org Artificial Intelligence

Though there is much interest in fair AI systems, the problem of fairness noncompliance -- which concerns whether fair models are used in practice -- has received lesser attention. Zero-Knowledge Proofs of Fairness (ZKPoF) address fairness noncompliance by allowing a service provider to verify to external parties that their model serves diverse demographics equitably, with guaranteed confidentiality over proprietary model parameters and data. They have great potential for building public trust and effective AI regulation, but no previous techniques for ZKPoF are fit for real-world deployment. We present OATH, the first ZKPoF framework that is (i) deployably efficient with client-facing communication comparable to in-the-clear ML as a Service query answering, and an offline audit phase that verifies an asymptotically constant quantity of answered queries, (ii) deployably flexible with modularity for any score-based classifier given a zero-knowledge proof of correct inference, (iii) deployably secure with an end-to-end security model that guarantees confidentiality and fairness across training, inference, and audits. We show that OATH obtains strong robustness against malicious adversaries at concretely efficient parameter settings. Notably, OATH provides a 1343x improvement to runtime over previous work for neural network ZKPoF, and scales up to much larger models -- even DNNs with tens of millions of parameters.


Mitigating Sex Bias in Audio Data-driven COPD and COVID-19 Breathing Pattern Detection Models

arXiv.org Artificial Intelligence

In the healthcare industry, researchers have been developing machine learning models to automate diagnosing patients with respiratory illnesses based on their breathing patterns. However, these models do not consider the demographic biases, particularly sex bias, that often occur when models are trained with a skewed patient dataset. Hence, it is essential in such an important industry to reduce this bias so that models can make fair diagnoses. In this work, we examine the bias in models used to detect breathing patterns of two major respiratory diseases, i.e., chronic obstructive pulmonary disease (COPD) and COVID-19. Using decision tree models trained with audio recordings of breathing patterns obtained from two open-source datasets consisting of 29 COPD and 680 COVID-19-positive patients, we analyze the effect of sex bias on the models. With a threshold optimizer and two constraints (demographic parity and equalized odds) to mitigate the bias, we witness 81.43% (demographic parity difference) and 71.81% (equalized odds difference) improvements. These findings are statistically significant.


Efficient Milling Quality Prediction with Explainable Machine Learning

arXiv.org Artificial Intelligence

This paper presents an explainable machine learning (ML) approach for predicting surface roughness in milling. Utilizing a dataset from milling aluminum alloy 2017A, the study employs random forest regression models and feature importance techniques. The key contributions include developing ML models that accurately predict various roughness values and identifying redundant sensors, particularly those for measuring normal cutting force. Our experiments show that removing certain sensors can reduce costs without sacrificing predictive accuracy, highlighting the potential of explainable machine learning to improve cost-effectiveness in machining.


Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection

arXiv.org Artificial Intelligence

Real-world tabular learning production scenarios typically involve evolving data streams, where data arrives continuously and its distribution may change over time. In such a setting, most studies in the literature regarding supervised learning favor the use of instance incremental algorithms due to their ability to adapt to changes in the data distribution. Another significant reason for choosing these algorithms is \textit{avoid storing observations in memory} as commonly done in batch incremental settings. However, the design of instance incremental algorithms often assumes immediate availability of labels, which is an optimistic assumption. In many real-world scenarios, such as fraud detection or credit scoring, labels may be delayed. Consequently, batch incremental algorithms are widely used in many real-world tasks. This raises an important question: "In delayed settings, is instance incremental learning the best option regarding predictive performance and computational efficiency?" Unfortunately, this question has not been studied in depth, probably due to the scarcity of real datasets containing delayed information. In this study, we conduct a comprehensive empirical evaluation and analysis of this question using a real-world fraud detection problem and commonly used generated datasets. Our findings indicate that instance incremental learning is not the superior option, considering on one side state-of-the-art models such as Adaptive Random Forest (ARF) and other side batch learning models such as XGBoost. Additionally, when considering the interpretability of the learning systems, batch incremental solutions tend to be favored. Code: \url{https://github.com/anselmeamekoe/DelayedLabelStream}