Goto

Collaborating Authors

 Decision Tree Learning


Superfast Selection for Decision Tree Algorithms

arXiv.org Artificial Intelligence

We present a novel and systematic method, called Superfast Selection, for selecting the "optimal split" for decision tree and feature selection algorithms over tabular data. The method speeds up split selection on a single feature by lowering the time complexity, from O(MN) (using the standard selection methods) to O(M), where M represents the number of input examples and N the number of unique values. Additionally, the need for pre-encoding, such as one-hot or integer encoding, for feature value heterogeneity is eliminated. To demonstrate the efficiency of Superfast Selection, we empower the CART algorithm by integrating Superfast Selection into it, creating what we call Ultrafast Decision Tree (UDT). This enhancement enables UDT to complete the training process with a time complexity O(KM$^2$) (K is the number of features). Additionally, the Training Only Once Tuning enables UDT to avoid the repetitive training process required to find the optimal hyper-parameter. Experiments show that the UDT can finish a single training on KDD99-10% dataset (494K examples with 41 features) within 1 second and tuning with 214.8 sets of hyper-parameters within 0.25 second on a laptop.


TabMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting

arXiv.org Artificial Intelligence

Tabular data is prevalent in many critical domains, yet it is often challenging to acquire in large quantities. This scarcity usually results in poor performance of machine learning models on such data. Data augmentation, a common strategy for performance improvement in vision and language tasks, typically underperforms for tabular data due to the lack of explicit symmetries in the input space. To overcome this challenge, we introduce TabMDA, a novel method for manifold data augmentation on tabular data. This method utilises a pre-trained in-context model, such as TabPFN, to map the data into a manifold space. TabMDA performs label-invariant transformations by encoding the data multiple times with varied contexts. This process explores the manifold of the underlying in-context models, thereby enlarging the training dataset. TabMDA is a training-free method, making it applicable to any classifier. We evaluate TabMDA on five standard classifiers and observe significant performance improvements across various tabular datasets. Our results demonstrate that TabMDA provides an effective way to leverage information from pre-trained in-context models to enhance the performance of downstream classifiers.


Learning Decision Trees and Forests with Algorithmic Recourse

arXiv.org Machine Learning

This paper proposes a new algorithm for learning accurate tree-based models while ensuring the existence of recourse actions. Algorithmic Recourse (AR) aims to provide a recourse action for altering the undesired prediction result given by a model. Typical AR methods provide a reasonable action by solving an optimization task of minimizing the required effort among executable actions. In practice, however, such actions do not always exist for models optimized only for predictive performance. To alleviate this issue, we formulate the task of learning an accurate classification tree under the constraint of ensuring the existence of reasonable actions for as many instances as possible. Then, we propose an efficient top-down greedy algorithm by leveraging the adversarial training techniques. We also show that our proposed algorithm can be applied to the random forest, which is known as a popular framework for learning tree ensembles. Experimental results demonstrated that our method successfully provided reasonable actions to more instances than the baselines without significantly degrading accuracy and computational efficiency.


A Random Forest-based Prediction Model for Turning Points in Antagonistic Event-Group Competitions

arXiv.org Artificial Intelligence

At present, most of the prediction studies related to antagonistic event-group competitions focus on the prediction of competition results, and less on the prediction of the competition process, which can not provide real-time feedback of the athletes' state information in the actual competition, and thus can not analyze the changes of the competition situation. In order to solve this problem, this paper proposes a prediction model based on Random Forest for the turning point of the antagonistic event-group. Firstly, the quantitative equation of competitive potential energy is proposed; Secondly, the quantitative value of competitive potential energy is obtained by using the dynamic combination of weights method, and the turning point of the competition situation of the antagonistic event-group is marked according to the quantitative time series graph; Finally, the random forest prediction model based on the optimisation of the KM-SMOTE algorithm and the grid search method is established. The experimental analysis shows that: The quantitative equation of competitive potential energy can effectively reflect the dynamic situation of the competition; The model can effectively predict the turning point of the competition situation of the antagonistic event-group, and the recall rate of the model in the test set is 86.13%; The model has certain significance for the future study of the competition situation of the antagonistic event-group.


Federated Random Forest for Partially Overlapping Clinical Data

arXiv.org Artificial Intelligence

In the healthcare sector, a consciousness surrounding data privacy and corresponding data protection regulations, as well as heterogeneous and non-harmonized data, pose huge challenges to large-scale data analysis. Moreover, clinical data often involves partially overlapping features, as some observations may be missing due to various reasons, such as differences in procedures, diagnostic tests, or other recorded patient history information across hospitals or institutes. To address the challenges posed by partially overlapping features and incomplete data in clinical datasets, a comprehensive approach is required. Particularly in the domain of medical data, promising outcomes are achieved by federated random forests whenever features align. However, for most standard algorithms, like random forest, it is essential that all data sets have identical parameters. Therefore, in this work the concept of federated random forest is adapted to a setting with partially overlapping features. Moreover, our research assesses the effectiveness of the newly developed federated random forest models for partially overlapping clinical data. For aggregating the federated, globally optimized model, only features available locally at each site can be used. We tackled two issues in federation: (i) the quantity of involved parties, (ii) the varying overlap of features. This evaluation was conducted across three clinical datasets. The federated random forest model even in cases where only a subset of features overlaps consistently demonstrates superior performance compared to its local counterpart. This holds true across various scenarios, including datasets with imbalanced classes. Consequently, federated random forests for partially overlapped data offer a promising solution to transcend barriers in collaborative research and corporate cooperation.


Counterfactual Metarules for Local and Global Recourse

arXiv.org Artificial Intelligence

We introduce T-CREx, a novel model-agnostic method for local and global counterfactual explanation (CE), which summarises recourse options for both individuals and groups in the form of human-readable rules. It leverages tree-based surrogate models to learn the counterfactual rules, alongside 'metarules' denoting their regions of optimality, providing both a global analysis of model behaviour and diverse recourse options for users. Experiments indicate that T-CREx achieves superior aggregate performance over existing rule-based baselines on a range of CE desiderata, while being orders of magnitude faster to run.


Unlocking Futures: A Natural Language Driven Career Prediction System for Computer Science and Software Engineering Students

arXiv.org Artificial Intelligence

A career is a crucial aspect for any person to fulfill their desires through hard work. During their studies, students cannot find the best career suggestions unless they receive meaningful guidance tailored to their skills. Therefore, we developed an AI-assisted model for early prediction to provide better career suggestions. Although the task is difficult, proper guidance can make it easier. Effective career guidance requires understanding a student's academic skills, interests, and skill-related activities. In this research, we collected essential information from Computer Science (CS) and Software Engineering (SWE) students to train a machine learning (ML) model that predicts career paths based on students' career-related information. To adequately train the models, we applied Natural Language Processing (NLP) techniques and completed dataset pre-processing. For comparative analysis, we utilized multiple classification ML algorithms and deep learning (DL) algorithms. This study contributes valuable insights to educational advising by providing specific career suggestions based on the unique features of CS and SWE students. Additionally, the research helps individual CS and SWE students find suitable jobs that match their skills, interests, and skill-related activities.


Exploring Loss Design Techniques For Decision Tree Robustness To Label Noise

arXiv.org Machine Learning

In the real world, data is often noisy, affecting not only the quality of features but also the accuracy of labels. Current research on mitigating label errors stems primarily from advances in deep learning, and a gap exists in exploring interpretable models, particularly those rooted in decision trees. In this study, we investigate whether ideas from deep learning loss design can be applied to improve the robustness of decision trees. In particular, we show that loss correction and symmetric losses, both standard approaches, are not effective. We argue that other directions need to be explored to improve the robustness of decision trees to label noise.


Machine learning in business process management: A systematic literature review

arXiv.org Artificial Intelligence

Machine learning (ML) provides algorithms to create computer programs based on data without explicitly programming them. In business process management (BPM), ML applications are used to analyse and improve processes efficiently. Three frequent examples of using ML are providing decision support through predictions, discovering accurate process models, and improving resource allocation. This paper organises the body of knowledge on ML in BPM. We extract BPM tasks from different literature streams, summarise them under the phases of a process`s lifecycle, explain how ML helps perform these tasks and identify technical commonalities in ML implementations across tasks. This study is the first exhaustive review of how ML has been used in BPM. We hope that it can open the door for a new era of cumulative research by helping researchers to identify relevant preliminary work and then combine and further develop existing approaches in a focused fashion. Our paper helps managers and consultants to find ML applications that are relevant in the current project phase of a BPM initiative, like redesigning a business process. We also offer - as a synthesis of our review - a research agenda that spreads ten avenues for future research, including applying novel ML concepts like federated learning, addressing less regarded BPM lifecycle phases like process identification, and delivering ML applications with a focus on end-users.


Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach

arXiv.org Artificial Intelligence

This article uses machine learning (ML) and explainable artificial intelligence (XAI) techniques to investigate the relationship between nutritional status and mortality rates associated with Alzheimers disease (AD). The Third National Health and Nutrition Examination Survey (NHANES III) database is employed for analysis. The random forest model is selected as the base model for XAI analysis, and the Shapley Additive Explanations (SHAP) method is used to assess feature importance. The results highlight significant nutritional factors such as serum vitamin B12 and glycated hemoglobin. The study demonstrates the effectiveness of random forests in predicting AD mortality compared to other diseases. This research provides insights into the impact of nutrition on AD and contributes to a deeper understanding of disease progression.