AITopics | Decision Tree Learning

Collaborating Authors

Decision Tree Learning

Learning to Classify with Branching Tests: "A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no decision. Decision trees therefore represent Boolean functions. Functions with a larger range of outputs can also be represented...."
– Artificial Intelligence: A Modern Approach. By Stuart Russell & Peter Norvig. 2002. Section 18.3; page 531.

News Overviews Instructional Materials AI-Alerts Classics

Systematic Bias in Sample Inference and its Effect on Machine Learning

O'Neill, Owen, Costello, Fintan

arXiv.org Artificial IntelligenceJul-3-2023

A commonly observed pattern in machine learning models is an underprediction of the target feature, with the model's predicted target rate for members of a given category typically being lower than the actual target rate for members of that category in the training set. This underprediction is usually larger for members of minority groups; while income level is underpredicted for both men and women in the 'adult' dataset, for example, the degree of underprediction is significantly higher for women (a minority in that dataset). We propose that this pattern of underprediction for minorities arises as a predictable consequence of statistical inference on small samples. When presented with a new individual for classification, an ML model performs inference not on the entire training set, but on a subset that is in some way similar to the new individual, with sizes of these subsets typically following a power law distribution so that most are small (and with these subsets being necessarily smaller for the minority group). We show that such inference on small samples is subject to systematic and directional statistical bias, and that this bias produces the observed patterns of underprediction seen in ML models. Analysing a standard sklearn decision tree model's predictions on a set of over 70 subsets of the 'adult' and COMPAS datasets, we found that a bias prediction measure based on small-sample inference had a significant positive correlations (0.56 and 0.85) with the observed underprediction rate for these subsets.

artificial intelligence, decision tree learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2307.01384

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Supervised Manifold Learning via Random Forest Geometry-Preserving Proximities

Rhodes, Jake S.

arXiv.org Artificial IntelligenceJul-3-2023

Manifold learning approaches seek the intrinsic, low-dimensional data structure within a high-dimensional space. Mainstream manifold learning algorithms, such as Isomap, UMAP, $t$-SNE, Diffusion Map, and Laplacian Eigenmaps do not use data labels and are thus considered unsupervised. Existing supervised extensions of these methods are limited to classification problems and fall short of uncovering meaningful embeddings due to their construction using order non-preserving, class-conditional distances. In this paper, we show the weaknesses of class-conditional manifold learning quantitatively and visually and propose an alternate choice of kernel for supervised dimensionality reduction using a data-geometry-preserving variant of random forest proximities as an initialization for manifold learning methods. We show that local structure preservation using these proximities is near universal across manifold learning approaches and global structure is properly maintained using diffusion-based algorithms.

artificial intelligence, decision tree learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2307.01077

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > United States > Utah > Utah County > Provo (0.04)

Genre: Research Report (0.40)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.78)

Add feedback

JoinBoost: Grow Trees Over Normalized Data Using Only SQL

Huang, Zezhou, Sen, Rathijit, Liu, Jiaxiang, Wu, Eugene

arXiv.org Artificial IntelligenceJul-1-2023

Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the $Y$ variable to the residual in the non-materialized join result. Although this view update problem is generally ambiguous, we identify addition-to-multiplication preserving, the key property of variance semi-ring to support rmse, the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3x (1.1x) faster for random forests (gradient boosting) compared to LightGBM, and over an order magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2307.00422

Country:

Oceania > New Zealand > North Island > Waikato (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.68)

Add feedback

A Comparative Study of Machine Learning Algorithms for Anomaly Detection in Industrial Environments: Performance and Environmental Impact

Huertas-García, Álvaro, Martí-González, Carlos, Maezo, Rubén García, Rey, Alejandro Echeverría

arXiv.org Artificial IntelligenceJul-1-2023

In the context of Industry 4.0, the use of artificial intelligence (AI) and machine learning for anomaly detection is being hampered by high computational requirements and associated environmental effects. This study seeks to address the demands of high-performance machine learning models with environmental sustainability, contributing to the emerging discourse on 'Green AI.' An extensive variety of machine learning algorithms, coupled with various Multilayer Perceptron (MLP) configurations, were meticulously evaluated. Our investigation encapsulated a comprehensive suite of evaluation metrics, comprising Accuracy, Area Under the Curve (AUC), Recall, Precision, F1 Score, Kappa Statistic, Matthews Correlation Coefficient (MCC), and F1 Macro. Simultaneously, the environmental footprint of these models was gauged through considerations of time duration, CO2 equivalent, and energy consumption during the training, cross-validation, and inference phases. Traditional machine learning algorithms, such as Decision Trees and Random Forests, demonstrate robust efficiency and performance. However, superior outcomes were obtained with optimised MLP configurations, albeit with a commensurate increase in resource consumption. The study incorporated a multi-objective optimisation approach, invoking Pareto optimality principles, to highlight the trade-offs between a model's performance and its environmental impact. The insights derived underscore the imperative of striking a balance between model performance, complexity, and environmental implications, thus offering valuable directions for future work in the development of environmentally conscious machine learning models for industrial applications.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2307.00361

Country:

North America > United States (0.04)
Europe > Spain > Galicia > Madrid (0.04)

Genre: Research Report > New Finding (0.69)

Industry:

Energy (1.00)
Law > Environmental Law (0.63)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(4 more...)

Add feedback

Medoid splits for efficient random forests in metric spaces

Bulté, Matthieu, Sørensen, Helle

arXiv.org Machine LearningJun-29-2023

This paper revisits an adaptation of the random forest algorithm for Fr\'echet regression, addressing the challenge of regression in the context of random objects in metric spaces. Recognizing the limitations of previous approaches, we introduce a new splitting rule that circumvents the computationally expensive operation of Fr\'echet means by substituting with a medoid-based approach. We validate this approach by demonstrating its asymptotic equivalence to Fr\'echet mean-based procedures and establish the consistency of the associated regression estimator. The paper provides a sound theoretical framework and a more efficient computational approach to Fr\'echet regression, broadening its application to non-standard data types and complex use cases.

artificial intelligence, decision tree learning, machine learning, (18 more...)

arXiv.org Machine Learning

2306.17031

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.73)

Add feedback

A Meta-analytical Comparison of Naive Bayes and Random Forest for Software Defect Prediction

Awais, Ch Muhammad, Gu, Wei, Dlamini, Gcinizwe, Kholmatova, Zamira, Succi, Giancarlo

arXiv.org Artificial IntelligenceJun-27-2023

Is there a statistical difference between Naive Bayes and Random Forest in terms of recall, f-measure, and precision for predicting software defects? By utilizing systematic literature review and meta-analysis, we are answering this question. We conducted a systematic literature review by establishing criteria to search and choose papers, resulting in five studies. After that, using the meta-data and forest-plots of five chosen papers, we conducted a meta-analysis to compare the two models. The results have shown that there is no significant statistical evidence that Naive Bayes perform differently from Random Forest in terms of recall, f-measure, and precision.

artificial intelligence, machine learning, software defect prediction, (11 more...)

arXiv.org Artificial Intelligence

2306.15369

Country:

North America > United States > California (0.04)
Europe > Russia (0.04)
Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
Asia > Russia (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.83)

Add feedback

Exceedance Probability Forecasting via Regression for Significant Wave Height Prediction

Cerqueira, Vitor, Torgo, Luis

arXiv.org Artificial IntelligenceJun-26-2023

Significant wave height forecasting is a key problem in ocean data analytics. Predicting the significant wave height is crucial for estimating the energy production from waves. Moreover, the timely prediction of large waves is important to ensure the safety of maritime operations, e.g. passage of vessels. We frame the task of predicting extreme values of significant wave height as an exceedance probability forecasting problem. Accordingly, we aim at estimating the probability that the significant wave height will exceed a predefined threshold. This task is usually solved using a probabilistic binary classification model. Instead, we propose a novel approach based on a forecasting model. The method leverages the forecasts for the upcoming observations to estimate the exceedance probability according to the cumulative distribution function. We carried out experiments using data from a buoy placed in the coast of Halifax, Canada. The results suggest that the proposed methodology is better than state-of-the-art approaches for exceedance probability forecasting.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2206.09821

Country:

North America > Canada > Nova Scotia > Halifax Regional Municipality > Halifax (0.34)
Oceania > Australia (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.88)
Research Report > Promising Solution (0.54)
Overview > Innovation (0.54)

Industry:

Banking & Finance (1.00)
Energy > Renewable > Ocean Energy (0.68)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

flexBART: Flexible Bayesian regression trees with categorical predictors

Deshpande, Sameer K.

arXiv.org Machine LearningJun-21-2023

Most implementations of Bayesian additive regression trees (BART) one-hot encode categorical predictors, replacing each one with several binary indicators, one for every level or category. Regression trees built with these indicators partition the discrete set of categorical levels by repeatedly removing one level at a time. Unfortunately, the vast majority of partitions cannot be built with this strategy, severely limiting BART's ability to partially pool data across groups of levels. Motivated by analyses of baseball data and neighborhood-level crime dynamics, we overcame this limitation by re-implementing BART with regression trees that can assign multiple levels to both branches of a decision tree node. To model spatial data aggregated into small regions, we further proposed a new decision rule prior that creates spatially contiguous regions by deleting a random edge from a random spanning tree of a suitably defined network. Our re-implementation, which is available in the flexBART package, often yields improved out-of-sample predictive performance and scales better to larger datasets than existing implementations of BART.

artificial intelligence, decision tree learning, machine learning, (19 more...)

arXiv.org Machine Learning

2211.04459

Country: North America > United States > Wisconsin > Dane County > Madison (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports > Baseball (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)

Add feedback

Classification Tree Pruning Under Covariate Shift

Galbraith, Nicholas, Kpotufe, Samory

arXiv.org Artificial IntelligenceJun-21-2023

We consider the problem of \emph{pruning} a classification tree, that is, selecting a suitable subtree that balances bias and variance, in common situations with inhomogeneous training data. Namely, assuming access to mostly data from a distribution $P_{X, Y}$, but little data from a desired distribution $Q_{X, Y}$ with different $X$-marginals, we present the first efficient procedure for optimal pruning in such situations, when cross-validation and other penalized variants are grossly inadequate. Optimality is derived with respect to a notion of \emph{average discrepancy} $P_{X} \to Q_{X}$ (averaged over $X$ space) which significantly relaxes a recent notion -- termed \emph{transfer-exponent} -- shown to tightly capture the limits of classification under such a distribution shift. Our relaxed notion can be viewed as a measure of \emph{relative dimension} between distributions, as it relates to existing notions of information such as the Minkowski and Renyi dimensions.

artificial intelligence, exponent, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2305.04335

Country:

North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > Canada > Quebec > Montreal (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.93)

Add feedback

Accelerating Generalized Random Forests with Fixed-Point Trees

Fleischer, David, Stephens, David A., Yang, Archer

arXiv.org Artificial IntelligenceJun-20-2023

Generalized random forests arXiv:1610.01271 build upon the well-established success of conventional forests (Breiman, 2001) to offer a flexible and powerful non-parametric method for estimating local solutions of heterogeneous estimating equations. Estimators are constructed by leveraging random forests as an adaptive kernel weighting algorithm and implemented through a gradient-based tree-growing procedure. By expressing this gradient-based approximation as being induced from a single Newton-Raphson root-finding iteration, and drawing upon the connection between estimating equations and fixed-point problems arXiv:2110.11074, we propose a new tree-growing rule for generalized random forests induced from a fixed-point iteration type of approximation, enabling gradient-free optimization, and yielding substantial time savings for tasks involving even modest dimensionality of the target quantity (e.g. multiple/multi-level treatment effects). We develop an asymptotic theory for estimators obtained from forests whose trees are grown through the fixed-point splitting rule, and provide numerical simulations demonstrating that the estimators obtained from such forests are comparable to those obtained from the more costly gradient-based rule.

artificial intelligence, machine learning, tibshirani, (19 more...)

arXiv.org Artificial Intelligence

2306.11908

Country: North America > Canada > Quebec > Montreal (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback