AITopics | Decision Tree Learning

Collaborating Authors

Decision Tree Learning

Learning to Classify with Branching Tests: "A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no decision. Decision trees therefore represent Boolean functions. Functions with a larger range of outputs can also be represented...."
– Artificial Intelligence: A Modern Approach. By Stuart Russell & Peter Norvig. 2002. Section 18.3; page 531.

News Overviews Instructional Materials AI-Alerts Classics

Bridging Data Barriers among Participants: Assessing the Potential of Geoenergy through Federated Learning

Peng, Weike, Gao, Jiaxin, Chen, Yuntian, Wang, Shengwei

arXiv.org Artificial IntelligenceApr-29-2024

Machine learning algorithms emerge as a promising approach in energy fields, but its practical is hindered by data barriers, stemming from high collection costs and privacy concerns. This study introduces a novel federated learning (FL) framework based on XGBoost models, enabling safe collaborative modeling with accessible yet concealed data from multiple parties. Hyperparameter tuning of the models is achieved through Bayesian Optimization. To ascertain the merits of the proposed FL-XGBoost method, a comparative analysis is conducted between separate and centralized models to address a classical binary classification problem in geoenergy sector. The results reveal that the proposed FL framework strikes an optimal balance between privacy and accuracy. FL models demonstrate superior accuracy and generalization capabilities compared to separate models, particularly for participants with limited data or low correlation features and offers significant privacy benefits compared to centralized model. The aggregated optimization approach within the FL agreement proves effective in tuning hyperparameters. This study opens new avenues for assessing unconventional reservoirs through collaborative and privacy-preserving FL techniques.

artificial intelligence, machine learning, xgboost model, (18 more...)

arXiv.org Artificial Intelligence

2404.18527

Country:

North America > United States (1.00)
Asia > China (1.00)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Energy > Oil & Gas > Upstream (1.00)
Materials > Chemicals (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

A model-free subdata selection method for classification

Singh, Rakhi

arXiv.org Machine LearningApr-29-2024

Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the response and for both categorical and continuous predictors. We show analytically that the PED subdata results in a smaller Gini than a uniform subdata. Further, we demonstrate that the PED subdata has higher classification accuracy than other competing methods through extensive simulated and real datasets.

dataset, subdata, subdata selection method, (16 more...)

arXiv.org Machine Learning

2404.19127

Country:

North America > United States > New York > Broome County > Binghamton (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (0.48)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.49)

Add feedback

DTization: A New Method for Supervised Feature Scaling

Islam, Niful

arXiv.org Artificial IntelligenceApr-27-2024

Artificial intelligence is currently a dominant force in shaping various aspects of the world. Machine learning is a sub-field in artificial intelligence. Feature scaling is one of the data pre-processing techniques that improves the performance of machine learning algorithms. The traditional feature scaling techniques are unsupervised where they do not have influence of the dependent variable in the scaling process. In this paper, we have presented a novel feature scaling technique named DTization that employs decision tree and robust scaler for supervised feature scaling. The proposed method utilizes decision tree to measure the feature importance and based on the importance, different features get scaled differently with the robust scaler algorithm. The proposed method has been extensively evaluated on ten classification and regression datasets on various evaluation matrices and the results show a noteworthy performance improvement compared to the traditional feature scaling methods.

artificial intelligence, decision tree learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2404.17937

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Information Technology (0.70)
Energy > Oil & Gas > Upstream (0.36)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping

Sirocchi, Christel, Urschler, Martin, Pfeifer, Bastian

arXiv.org Artificial IntelligenceApr-27-2024

Interpretable machine learning has emerged as central in leveraging artificial intelligence within high-stakes domains such as healthcare, where understanding the rationale behind model predictions is as critical as achieving high predictive accuracy. In this context, feature selection assumes a pivotal role in enhancing model interpretability by identifying the most important input features in black-box models. While random forests are frequently used in biomedicine for their remarkable performance on tabular datasets, the accuracy gained from aggregating decision trees comes at the expense of interpretability. Consequently, feature selection for enhancing interpretability in random forests has been extensively explored in supervised settings. However, its investigation in the unsupervised regime remains notably limited. To address this gap, the study introduces novel methods to construct feature graphs from unsupervised random forests and feature selection strategies to derive effective feature combinations from these graphs. Feature graphs are constructed for the entire dataset as well as individual clusters leveraging the parent-child node splits within the trees, such that feature centrality captures their relevance to the clustering task, while edge weights reflect the discriminating power of feature pairs. Graph-based feature selection methods are extensively evaluated on synthetic and benchmark datasets both in terms of their ability to reduce dimensionality while improving clustering performance, as well as to enhance model interpretability. An application on omics data for disease subtyping identifies the top features for each cluster, showcasing the potential of the proposed approach to enhance interpretability in clustering analyses and its utility in a real-world biomedical application.

feature graph, graph, random forest, (14 more...)

arXiv.org Artificial Intelligence

2404.17886

Country:

Europe > Austria > Vienna (0.14)
Europe > Italy (0.04)
Europe > Austria > Styria > Graz (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data

Lautrup, Anton Danholt, Hyrup, Tobias, Zimek, Arthur, Schneider-Kamp, Peter

arXiv.org Artificial IntelligenceApr-24-2024

With the growing demand for synthetic data to address contemporary issues in machine learning, such as data scarcity, data fairness, and data privacy, having robust tools for assessing the utility and potential privacy risks of such data becomes crucial. SynthEval, a novel open-source evaluation framework distinguishes itself from existing tools by treating categorical and numerical attributes with equal care, without assuming any special kind of preprocessing steps. This~makes it applicable to virtually any synthetic dataset of tabular records. Our tool leverages statistical and machine learning techniques to comprehensively evaluate synthetic data fidelity and privacy-preserving integrity. SynthEval integrates a wide selection of metrics that can be used independently or in highly customisable benchmark configurations, and can easily be extended with additional metrics. In this paper, we describe SynthEval and illustrate its versatility with examples. The framework facilitates better benchmarking and more consistent comparisons of model capabilities.

dataset, synthetic data, syntheval, (16 more...)

arXiv.org Artificial Intelligence

2404.15821

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Colorado > El Paso County > Colorado Springs (0.04)
(5 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

Add feedback

Na\"ive Bayes and Random Forest for Crop Yield Prediction

Maazallahi, Abbas, Thota, Sreehari, Kondaboina, Naga Prasad, Muktineni, Vineetha, Annem, Deepthi, Rokkam, Abhi Stephen, Amini, Mohammad Hossein, Salari, Mohammad Amir, Norouzzadeh, Payam, Snir, Eli, Rahmani, Bahareh

arXiv.org Artificial IntelligenceApr-23-2024

This study analyzes crop yield prediction in India from 1997 to 2020, focusing on various crops and key environmental factors. It aims to predict agricultural yields by utilizing advanced machine learning techniques like Linear Regression, Decision Tree, KNN, Na\"ive Bayes, K-Mean Clustering, and Random Forest. The models, particularly Na\"ive Bayes and Random Forest, demonstrate high effectiveness, as shown through data visualizations. The research concludes that integrating these analytical methods significantly enhances the accuracy and reliability of crop yield predictions, offering vital contributions to agricultural data science.

dataset, prediction, random forest, (16 more...)

arXiv.org Artificial Intelligence

2404.15392

Country:

North America > United States > Missouri > St. Louis County > St. Louis (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)
Asia > India > Maharashtra (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry: Food & Agriculture > Agriculture (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.93)
(2 more...)

Add feedback

PEACH: Pretrained-embedding Explanation Across Contextual and Hierarchical Structure

Cao, Feiqi, Han, Caren, Chung, Hyunsuk

arXiv.org Artificial IntelligenceApr-21-2024

In this work, we propose a novel tree-based explanation technique, PEACH (Pretrained-embedding Explanation Across Contextual and Hierarchical Structure), that can explain how text-based documents are classified by using any pretrained contextual embeddings in a tree-based human-interpretable manner. Note that PEACH can adopt any contextual embeddings of the PLMs as a training input for the decision tree. Using the proposed PEACH, we perform a comprehensive analysis of several contextual embeddings on nine different NLP text classification benchmarks. This analysis demonstrates the flexibility of the model by applying several PLM contextual embeddings, its attribute selections, scaling, and clustering methods. Furthermore, we show the utility of explanations by visualising the feature selection and important trend of text classification via human-interpretable word-cloud-based trees, which clearly identify model mistakes and assist in dataset debugging. Besides interpretability, PEACH outperforms or is similar to those from pretrained models.

dataset, decision tree, explanation decision tree, (14 more...)

arXiv.org Artificial Intelligence

2404.13645

Country:

North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
(6 more...)

Genre: Research Report (0.64)

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.87)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.87)
(4 more...)

Add feedback

Explainable Machine Learning System for Predicting Chronic Kidney Disease in High-Risk Cardiovascular Patients

Nguycharoen, Nantika

arXiv.org Artificial IntelligenceApr-17-2024

As the global population ages, the incidence of Chronic Kidney Disease (CKD) is rising. CKD often remains asymptomatic until advanced stages, which significantly burdens both the healthcare system and patient quality of life. This research developed an explainable machine learning system for predicting CKD in patients with cardiovascular risks, utilizing medical history and laboratory data. The Random Forest model achieved the highest sensitivity of 88.2%. The study introduces a comprehensive explainability framework that extends beyond traditional feature importance methods, incorporating global and local interpretations, bias inspection, biomedical relevance, and safety assessments. Key predictive features identified in global interpretation were the use of diabetic and ACEI/ARB medications, and initial eGFR values. Local interpretation provided model insights through counterfactual explanations, which aligned with other system parts. After conducting a bias inspection, it was found that the initial eGFR values and CKD predictions exhibited some bias, but no significant gender bias was identified. The model's logic, extracted by scoped rules, was confirmed to align with existing medical literature. The safety assessment tested potentially dangerous cases and confirmed that the model behaved safely. This system enhances the explainability, reliability, and accountability of the model, promoting its potential integration into healthcare settings and compliance with upcoming regulatory standards, and showing promise for broader applications in healthcare machine learning.

ckd, explainable machine learning system, prediction, (12 more...)

arXiv.org Artificial Intelligence

2404.11148

Country:

North America > United States (0.14)
Asia > Middle East > UAE (0.14)
Europe > United Kingdom > England > Merseyside > Liverpool (0.14)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area > Nephrology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.35)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.35)

Add feedback

Finding Decision Tree Splits in Streaming and Massively Parallel Models

Pham, Huy, Ta, Hoang, Vu, Hoa T.

arXiv.org Artificial IntelligenceApr-17-2024

In this work, we provide data stream algorithms that compute optimal splits in decision tree learning. In particular, given a data stream of observations $x_i$ and their labels $y_i$, the goal is to find the optimal split point $j$ that divides the data into two sets such that the mean squared error (for regression) or misclassification rate (for classification) is minimized. We provide various fast streaming algorithms that use sublinear space and a small number of passes for these problems. These algorithms can also be extended to the massively parallel computation model. Our work, while not directly comparable, complements the seminal work of Domingos and Hulten (KDD 2000).

algorithm, misclassification rate, optimal split, (16 more...)

arXiv.org Artificial Intelligence

2403.19867

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Vietnam > Hanoi > Hanoi (0.04)
Asia > Singapore > Central Region > Singapore (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Exploring Social Media Posts for Depression Identification: A Study on Reddit Dataset

Harshit, Nandigramam Sai, Sahu, Nilesh Kumar, Lone, Haroon R.

arXiv.org Artificial IntelligenceApr-16-2024

Depression is one of the most common mental disorders affecting an individual's personal and professional life. In this work, we investigated the possibility of utilizing social media posts to identify depression in individuals. To achieve this goal, we conducted a preliminary study where we extracted and analyzed the top Reddit posts made in 2022 from depression-related forums. The collected data were labeled as depressive and non-depressive using UMLS Metathesaurus. Further, the pre-processed data were fed to classical machine learning models, where we achieved an accuracy of 92.28\% in predicting the depressive and non-depressive posts.

depression, depression identification, exploring social media post, (11 more...)

arXiv.org Artificial Intelligence

2405.06656

Country:

Asia > India > Uttarakhand > Dehradun (0.07)
North America > United States > New York > New York County > New York City (0.05)
Asia > India > Madhya Pradesh > Bhopal (0.05)

Genre: Research Report (0.66)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.96)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.38)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.31)

Add feedback