Decision Tree Learning
Biomimetic Machine Learning approach for prediction of mechanical properties of Additive Friction Stir Deposited Aluminum alloys based walled structures
This study presents a novel approach to predicting mechanical properties of Additive Friction Stir Deposited (AFSD) aluminum alloy walled structures using biomimetic machine learning. The research combines numerical modeling of the AFSD process with genetic algorithm-optimized machine learning models to predict von Mises stress and logarithmic strain. Finite element analysis was employed to simulate the AFSD process for five aluminum alloys: AA2024, AA5083, AA5086, AA7075, and AA6061, capturing complex thermal and mechanical interactions. A dataset of 200 samples was generated from these simulations. Subsequently, Decision Tree (DT) and Random Forest (RF) regression models, optimized using genetic algorithms, were developed to predict key mechanical properties. The GA-RF model demonstrated superior performance in predicting both von Mises stress (R square = 0.9676) and logarithmic strain (R square = 0.7201). This innovative approach provides a powerful tool for understanding and optimizing the AFSD process across multiple aluminum alloys, offering insights into material behavior under various process parameters.
Tracking Emotional Dynamics in Chat Conversations: A Hybrid Approach using DistilBERT and Emoji Sentiment Analysis
Igali, Ayan, Abdrakhman, Abdulkhak, Torekhan, Yerdaut, Shamoi, Pakizar
Computer-mediated communication has become more important than face-to-face communication in many contexts. Tracking emotional dynamics in chat conversations can enhance communication, improve services, and support well-being in various contexts. This paper explores a hybrid approach to tracking emotional dynamics in chat conversations by combining DistilBERT-based text emotion detection and emoji sentiment analysis. A Twitter dataset was analyzed using various machine learning algorithms, including SVM, Random Forest, and AdaBoost. We contrasted their performance with DistilBERT. Results reveal DistilBERT's superior performance in emotion recognition. Our approach accounts for emotive expressions conveyed through emojis to better understand participants' emotions during chats. We demonstrate how this approach can effectively capture and analyze emotional shifts in real-time conversations. Our findings show that integrating text and emoji analysis is an effective way of tracking chat emotion, with possible applications in customer service, work chats, and social media interactions.
Distilling interpretable causal trees from causal forests
Machine learning methods for estimating treatment effect heterogeneity promise greater flexibility than existing methods that test a few pre-specified hypotheses. However, one problem these methods can have is that it can be challenging to extract insights from complicated machine learning models. A high-dimensional distribution of conditional average treatment effects may give accurate, individual-level estimates, but it can be hard to understand the underlying patterns; hard to know what the implications of the analysis are. This paper proposes the Distilled Causal Tree, a method for distilling a single, interpretable causal tree from a causal forest. This compares well to existing methods of extracting a single tree, particularly in noisy data or high-dimensional data where there are many correlated features. Here it even outperforms the base causal forest in most simulations. Its estimates are doubly robust and asymptotically normal just as those of the causal forest are.
Optimal Mixed Integer Linear Optimization Trained Multivariate Classification Trees
Alston, Brandon, Hicks, Illya V.
Multivariate decision trees are powerful machine learning tools for classification and regression that attract many researchers and industry professionals. An optimal binary tree has two types of vertices, (i) branching vertices which have exactly two children and where datapoints are assessed on a set of discrete features and (ii) leaf vertices at which datapoints are given a prediction, and can be obtained by solving a biobjective optimization problem that seeks to (i) maximize the number of correctly classified datapoints and (ii) minimize the number of branching vertices. Branching vertices are linear combinations of training features and therefore can be thought of as hyperplanes. In this paper, we propose two cut-based mixed integer linear optimization (MILO) formulations for designing optimal binary classification trees (leaf vertices assign discrete classes). Our models leverage on-the-fly identification of minimal infeasible subsystems (MISs) from which we derive cutting planes that hold the form of packing constraints. We show theoretical improvements on the strongest flow-based MILO formulation currently in the literature and conduct experiments on publicly available datasets to show our models' ability to scale, strength against traditional branch and bound approaches, and robustness in out-of-sample test performance. Our code and data are available on GitHub.
Data-Driven Machine Learning Approaches for Predicting In-Hospital Sepsis Mortality
Shumilov, Arseniy, Zhu, Yueting, Ashrafi, Negin, Lian, Gaojie, Ren, Shilong, Pishgar, Maryam
Background: Sepsis is a severe condition responsible for many deaths worldwide. Accurate prediction of sepsis outcomes is crucial for timely and effective treatment. Although previous studies have used ML to forecast outcomes, they faced limitations in feature selection and model comprehensibility, resulting in less effective predictions. Thus, this research aims to develop an interpretable and accurate ML model to help clinical professionals predict in-hospital mortality. Methods: We analyzed ICU patient records from the MIMIC-III database based on specific criteria and extracted relevant data. Our feature selection process included a literature review, clinical input refinement, and using Random Forest to select the top 35 features. We performed data preprocessing, including cleaning, imputation, standardization, and applied SMOTE for oversampling to address imbalance, resulting in 4,683 patients, with admission counts of 17,429. We compared the performance of Random Forest, Gradient Boosting, Logistic Regression, SVM, and KNN models. Results: The Random Forest model was the most effective in predicting sepsis-related in-hospital mortality. It outperformed other models, achieving an accuracy of 0.90 and an AUROC of 0.97, significantly better than the existing literature. Our meticulous feature selection contributed to the model's precision and identified critical determinants of sepsis mortality. These results underscore the pivotal role of data-driven ML in healthcare, especially for predicting in-hospital mortality due to sepsis. Conclusion: This study represents a significant advancement in predicting in-hospital sepsis mortality, highlighting the potential of ML in healthcare. The implications are profound, offering a data-driven approach that enhances decision-making in patient care and reduces in-hospital mortality.
Open Set Recognition for Random Forest
Feng, Guanchao, Desai, Dhruv, Pasquali, Stefano, Mehta, Dhagash
In the open-set settings, classi ers are required to not only accurately classify new instances of known In many real-world classi cation or recognition tasks, it is often classes (whose samples are observed during training) but also e ectively di cult to collect training examples that exhaust all possible classes recognize the samples from unknown classes. In a nutshell, due to, for example, incomplete knowledge during training or ever open-set classi ers are capable of making the "none of the above" changing regimes. Therefore, samples from unknown/novel classes decision with respect to known classes. This is known as open-set may be encountered in testing/deployment. In such scenarios, the recognition (OSR) [38] and has received signi cant attention in classi ers should be able to i) perform classi cation on known recent years [11, 47]. Since many learning tasks in nance are naturally classes, and at the same time, ii) identify samples from unknown classi cation tasks, for instance, company classi cations using classes. This is known as open-set recognition. Although random Global Industry Classi cation Standard (GICS), fund categorization, forest has been an extremely successful framework as a generalpurpose risk pro ling, economic scenario classi cations, etc., where often a classi cation (and regression) method, in practice, it usually new company, fund or economic scenario may not belong to any operates under the closed-set assumption and is not able to identify of the existing categories, casting these recognition tasks as OSR samples from new classes when run out of the box. In this work, we instead of traditional closed-set classi cation tasks is more appropriate.
A data balancing approach towards design of an expert system for Heart Disease Prediction
Karmakar, Rahul, Ghosh, Udita, Pal, Arpita, Dey, Sattwiki, Malik, Debraj, Sain, Priyabrata
Heart disease is a serious global health issue that claims millions of lives every year. Early detection and precise prediction are critical to the prevention and successful treatment of heart related issues. A lot of research utilizes machine learning (ML) models to forecast cardiac disease and obtain early detection. In order to do predictive analysis on "Heart disease health indicators " dataset. We employed five machine learning methods in this paper: Decision Tree (DT), Random Forest (RF), Linear Discriminant Analysis, Extra Tree Classifier, and AdaBoost. The model is further examined using various feature selection (FS) techniques. To enhance the baseline model, we have separately applied four FS techniques: Sequential Forward FS, Sequential Backward FS, Correlation Matrix, and Chi2. Lastly, K means SMOTE oversampling is applied to the models to enable additional analysis. The findings show that when it came to predicting heart disease, ensemble approaches in particular, random forests performed better than individual classifiers. The presence of smoking, blood pressure, cholesterol, and physical inactivity were among the major predictors that were found. The accuracy of the Random Forest and Decision Tree model was 99.83%. This paper demonstrates how machine learning models can improve the accuracy of heart disease prediction, especially when using ensemble methodologies. The models provide a more accurate risk assessment than traditional methods since they incorporate a large number of factors and complex algorithms.
An Interpretable Rule Creation Method for Black-Box Models based on Surrogate Trees -- SRules
Verdasco, Mario Parrรณn, Garcรญa-Cuesta, Esteban
As artificial intelligence (AI) systems become increasingly integrated into critical decision-making processes, the need for transparent and interpretable models has become paramount. In this article we present a new ruleset creation method based on surrogate decision trees (SRules), designed to improve the interpretability of black-box machine learning models. SRules balances the accuracy, coverage, and interpretability of machine learning models by recursively creating surrogate interpretable decision tree models that approximate the decision boundaries of a complex model. We propose a systematic framework for generating concise and meaningful rules from these surrogate models, allowing stakeholders to understand and trust the AI system's decision-making process. Our approach not only provides interpretable rules, but also quantifies the confidence and coverage of these rules. The proposed model allows to adjust its parameters to counteract the lack of interpretability by precision and coverage by allowing a near perfect fit and high interpretability of some parts of the model . The results show that SRules improves on other state-of-the-art techniques and introduces the possibility of creating highly interpretable specific rules for specific sub-parts of the model.
A collaborative ensemble construction method for federated random forest
Lim, Penjan Antonio Eng, Park, Cheong Hee
Random forests are considered a cornerstone in machine learning for their robustness and versatility. Despite these strengths, their conventional centralized training is ill-suited for the modern landscape of data that is often distributed, sensitive, and subject to privacy concerns. Federated learning (FL) provides a compelling solution to this problem, enabling models to be trained across a group of clients while maintaining the privacy of each client's data. However, adapting tree-based methods like random forests to federated settings introduces significant challenges, particularly when it comes to non-identically distributed (non-IID) data across clients, which is a common scenario in real-world applications. This paper presents a federated random forest approach that employs a novel ensemble construction method aimed at improving performance under non-IID data. Instead of growing trees independently in each client, our approach ensures each decision tree in the ensemble is iteratively and collectively grown across clients. To preserve the privacy of the client's data, we confine the information stored in the leaf nodes to the majority class label identified from the samples of the client's local data that reach each node. This limited disclosure preserves the confidentiality of the underlying data distribution of clients, thereby enhancing the privacy of the federated learning process. Furthermore, our collaborative ensemble construction strategy allows the ensemble to better reflect the data's heterogeneity across different clients, enhancing its performance on non-IID data, as our experimental results confirm.
Utilising Explainable Techniques for Quality Prediction in a Complex Textiles Manufacturing Use Case
Forsberg, Briony, Williams, Dr Henry, MacDonald, Prof Bruce, Chen, Tracy, Hamzeh, Dr Reza, Hulse, Dr Kirstine
This paper develops an approach to classify instances of product failure in a complex textiles manufacturing dataset using explainable techniques. The dataset used in this study was obtained from a New Zealand manufacturer of woollen carpets and rugs. In investigating the trade-off between accuracy and explainability, three different tree-based classification algorithms were evaluated: a Decision Tree and two ensemble methods, Random Forest and XGBoost. Additionally, three feature selection methods were also evaluated: the SelectKBest method, using chi-squared as the scoring function, the Pearson Correlation Coefficient, and the Boruta algorithm. Not surprisingly, the ensemble methods typically produced better results than the Decision Tree model. The Random Forest model yielded the best results overall when combined with the Boruta feature selection technique. Finally, a tree ensemble explaining technique was used to extract rule lists to capture necessary and sufficient conditions for classification by a trained model that could be easily interpreted by a human. Notably, several features that were in the extracted rule lists were statistical features and calculated features that were added to the original dataset. This demonstrates the influence that bringing in additional information during the data preprocessing stages can have on the ultimate model performance.