Goto

Collaborating Authors

 Decision Tree Learning


Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

arXiv.org Artificial Intelligence

Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly to inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in agentic setup. We design a minimal set of tools for constructing, analyzing and manipulating decision trees. By using these tools, LLMs combine their prior knowledge with learning from data to create a lightweight decision tree that outperforms traditional CART on low-resource tabular problems. While a single decision tree does not outperform state-of-the-art black box models, it comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM's creation process allows for additional human input: correcting biases or incorporating domain-specific intuition that is not captured in the data.


Actively Learning Halfspaces without Synthetic Data

arXiv.org Artificial Intelligence

In the classic point location problem, one is given an arbitrary dataset $X \subset \mathbb{R}^d$ of $n$ points with query access to an unknown halfspace $f : \mathbb{R}^d \to \{0,1\}$, and the goal is to learn the label of every point in $X$. This problem is extremely well-studied and a nearly-optimal $\widetilde{O}(d \log n)$ query algorithm is known due to Hopkins-Kane-Lovett-Mahajan (FOCS 2020). However, their algorithm is granted the power to query arbitrary points outside of $X$ (point synthesis), and in fact without this power there is an $ฮฉ(n)$ query lower bound due to Dasgupta (NeurIPS 2004). In this work our goal is to design efficient algorithms for learning halfspaces without point synthesis. To circumvent the $ฮฉ(n)$ lower bound, we consider learning halfspaces whose normal vectors come from a set of size $D$, and show tight bounds of $ฮ˜(D + \log n)$. As a corollary, we obtain an optimal $O(d + \log n)$ query deterministic learner for axis-aligned halfspaces, closing a previous gap of $O(d \log n)$ vs. $ฮฉ(d + \log n)$. In fact, our algorithm solves the more general problem of learning a Boolean function $f$ over $n$ elements which is monotone under at least one of $D$ provided orderings. Our technical insight is to exploit the structure in these orderings to perform a binary search in parallel rather than considering each ordering sequentially, and we believe our approach may be of broader interest. Furthermore, we use our exact learning algorithm to obtain nearly optimal algorithms for PAC-learning. We show that $O(\min(D + \log(1/\varepsilon), 1/\varepsilon) \cdot \log D)$ queries suffice to learn $f$ within error $\varepsilon$, even in a setting when $f$ can be adversarially corrupted on a $c\varepsilon$-fraction of points, for a sufficiently small constant $c$. This bound is optimal up to a $\log D$ factor, including in the realizable setting.


Hierarchical Bayesian Operator-induced Symbolic Regression Trees for Structural Learning of Scientific Expressions

arXiv.org Machine Learning

The advent of Scientific Machine Learning has heralded a transformative era in scientific discovery, driving progress across diverse domains. Central to this progress is uncovering scientific laws from experimental data through symbolic regression. However, existing approaches are dominated by heuristic algorithms or data-hungry black-box methods, which often demand low-noise settings and lack principled uncertainty quantification. Motivated by interpretable Statistical Artificial Intelligence, we develop a hierarchical Bayesian framework for symbolic regression that represents scientific laws as ensembles of tree-structured symbolic expressions endowed with a regularized tree prior. This coherent probabilistic formulation enables full posterior inference via an efficient Markov chain Monte Carlo algorithm, yielding a balance between predictive accuracy and structural parsimony. To guide symbolic model selection, we develop a marginal posterior-based criterion adhering to the Occam's window principle and further quantify structural fidelity to ground truth through a tailored expression-distance metric. On the theoretical front, we establish near-minimax rate of Bayesian posterior concentration, providing the first rigorous guarantee in context of symbolic regression. Empirical evaluation demonstrates robust performance of our proposed methodology against state-of-the-art competing modules on a simulated example, a suite of canonical Feynman equations, and single-atom catalysis dataset.


TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees

arXiv.org Artificial Intelligence

Ensuring fairness in machine learning remains a significant challenge, as models often inherit biases from their training data. Generative models have recently emerged as a promising approach to mitigate bias at the data level while preserving utility. However, many rely on deep architectures, despite evidence that simpler models can be highly effective for tabular data. In this work, we introduce TABFAIRGDT, a novel method for generating fair synthetic tabular data using autoregressive decision trees. To enforce fairness, we propose a soft leaf resampling technique that adjusts decision tree outputs to reduce bias while preserving predictive performance. Our approach is non-parametric, effectively capturing complex relationships between mixed feature types, without relying on assumptions about the underlying data distributions. We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models, achieving better fairness-utility trade-off for downstream tasks, as well as higher synthetic data quality. Moreover, our method is lightweight, highly efficient, and CPU-compatible, requiring no data pre-processing. Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes, and can generate fair synthetic data for medium-sized datasets (10 features, 10K samples) in just one second on a standard CPU, making it an ideal solution for real-world fairness-sensitive applications.


Medical priority fusion: achieving dual optimization of sensitivity and interpretability in nipt anomaly detection

arXiv.org Artificial Intelligence

Clinical machine learning faces a critical dilemma in high-stakes medical applications: algorithms achieving optimal diagnostic performance typically sacrifice the interpretability essential for physician decision-making, while interpretable methods compromise sensitivity in complex scenarios. This paradox becomes particularly acute in non-invasive prenatal testing (NIPT), where missed chromosomal abnormalities carry profound clinical consequences yet regulatory frameworks mandate explainable AI systems. We introduce Medical Priority Fusion (MPF), a constrained multi-objective optimization framework that resolves this fundamental trade-off by systematically integrating Naive Bayes probabilistic reasoning with Decision Tree rule-based logic through mathematically-principled weighted fusion under explicit medical constraints. Rigorous validation on 1,687 real-world NIPT samples characterized by extreme class imbalance (43.4:1 normal-to-abnormal ratio) employed stratified 5-fold cross-validation with comprehensive ablation studies and statistical hypothesis testing using McNemar's paired comparisons. MPF achieved simultaneous optimization of dual objectives: 89.3% sensitivity (95% CI: 83.9-94.7%) with 80% interpretability score, significantly outperforming individual algorithms (McNemar's test, p < 0.001). The optimal fusion configuration achieved Grade A clinical deployment criteria with large effect size (d = 1.24), establishing the first clinically-deployable solution that maintains both diagnostic accuracy and decision transparency essential for prenatal care. This work demonstrates that medical-constrained algorithm fusion can resolve the interpretability-performance trade-off, providing a mathematical framework for developing high-stakes medical decision support systems that meet both clinical efficacy and explainability requirements.


Interpretable Network-assisted Random Forest+

arXiv.org Machine Learning

Machine learning algorithms often assume that training samples are independent. When data points are connected by a network, the induced dependency between samples is both a challenge, reducing effective sample size, and an opportunity to improve prediction by leveraging information from network neighbors. Multiple methods taking advantage of this opportunity are now available, but many, including graph neural networks, are not easily interpretable, limiting their usefulness for understanding how a model makes its predictions. Others, such as network-assisted linear regression, are interpretable but often yield substantially worse prediction performance. We bridge this gap by proposing a family of flexible network-assisted models built upon a generalization of random forests (RF+), which achieves highly-competitive prediction accuracy and can be interpreted through feature importance measures. In particular, we develop a suite of interpretation tools that enable practitioners to not only identify important features that drive model predictions, but also quantify the importance of the network contribution to prediction. Importantly, we provide both global and local importance measures as well as sample influence measures to assess the impact of a given observation. This suite of tools broadens the scope and applicability of network-assisted machine learning for high-impact problems where interpretability and transparency are essential.


Next-Depth Lookahead Tree

arXiv.org Machine Learning

This paper proposes the Next-Depth Lookahead Tree (NDLT), a single-tree model designed to improve performance by evaluating node splits not only at the node being optimized but also by evaluating the quality of the next depth level. Conventional decision trees (DTs) rely on greedy node-by-node partitioning, which often fails to ensure global optimality of the tree and is prone to local optima when early splits are suboptimal. To overcome this limitation, NDLT employs a next-depth lookahead strategy that jointly considers the immediate impurity reduction at the parent node and the expected impurity reduction at its child nodes. Empirical evaluation on diverse and complex datasets, including high-dimensional and imbalanced cases, demonstrates that NDLT achieves performance comparable to or better than classical DTs and ensemble models such as Random Forests, XGBoost, and Light-GBM. These results show that NDLT preserves the interpretability of a single tree while delivering robust predictive accuracy, making it an effective approach for real-world applications where both transparency and performance are required. These authors contributed equally to this work. Introduction Decision Tree (DT) represents one of the earliest and most established algorithms in the field of machine learning, valued for its simplicity, inter-pretability, and broad applicability [1].


Evaluating Supervised Learning Models for Fraud Detection: A Comparative Study of Classical and Deep Architectures on Imbalanced Transaction Data

arXiv.org Artificial Intelligence

Fraud detection remains a critical task in high-stakes domains such as finance and e-commerce, where undetected fraudulent transactions can lead to significant economic losses. In this study, we systematically compare the performance of four supervised learning models - Logistic Regression, Random Forest, Light Gradient Boosting Machine (LightGBM), and a Gated Recurrent Unit (GRU) network - on a large-scale, highly imbalanced online transaction dataset. While ensemble methods such as Random Forest and LightGBM demonstrated superior performance in both overall and class-specific metrics, Logistic Regression offered a reliable and interpretable baseline. The GRU model showed strong recall for the minority fraud class, though at the cost of precision, highlighting a trade-off relevant for real-world deployment. Our evaluation emphasizes not only weighted averages but also per-class precision, recall, and F1-scores, providing a nuanced view of each model's effectiveness in detecting rare but consequential fraudulent activity. The findings underscore the importance of choosing models based on the specific risk tolerance and operational needs of fraud detection systems.


ZTree: A Subgroup Identification Based Decision Tree Learning Framework

arXiv.org Artificial Intelligence

Decision trees are a commonly used class of machine learning models valued for their interpretability and versatility, capable of both classification and regression. We propose ZTree, a novel decision tree learning framework that replaces CART's traditional purity based splitting with statistically principled subgroup identification. At each node, ZTree applies hypothesis testing (e.g., z-tests, t-tests, Mann-Whitney U, log-rank) to assess whether a candidate subgroup differs meaningfully from the complement. To adjust for the complication of multiple testing, we employ a cross-validation-based approach to determine if further node splitting is needed. This robust stopping criterion eliminates the need for post-pruning and makes the test threshold (z-threshold) the only parameter for controlling tree complexity. Because of the simplicity of the tree growing procedure, once a detailed tree is learned using the most lenient z-threshold, all simpler trees can be derived by simply removing nodes that do not meet the larger z-thresholds. This makes parameter tuning intuitive and efficient. Furthermore, this z-threshold is essentially a p-value, allowing users to easily plug in appropriate statistical tests into our framework without adjusting the range of parameter search. Empirical evaluation on five large-scale UCI datasets demonstrates that ZTree consistently delivers strong performance, especially at low data regimes. Compared to CART, ZTree also tends to grow simpler trees without sacrificing performance. ZTree introduces a statistically grounded alternative to traditional decision tree splitting by leveraging hypothesis testing and a cross-validation approach to multiple testing correction, resulting in an efficient and flexible framework.


The Honest Truth About Causal Trees: Accuracy Limits for Heterogeneous Treatment Effect Estimation

arXiv.org Machine Learning

Recursive decision trees have emerged as a leading methodology for heterogeneous causal treatment effect estimation and inference in experimental and observational settings. These procedures are fitted using the celebrated CART (Classification And Regression Tree) algorithm [Breiman et al., 1984], or custom variants thereof, and hence are believed to be "adaptive" to high-dimensional data, sparsity, or other specific features of the underlying data generating process. Athey and Imbens [2016] proposed several "honest" causal decision tree estimators, which have become the standard in both academia and industry. We study their estimators, and variants thereof, and establish lower bounds on their estimation error. We demonstrate that these popular heterogeneous treatment effect estimators cannot achieve a polynomial-in-$n$ convergence rate under basic conditions, where $n$ denotes the sample size. Contrary to common belief, honesty does not resolve these limitations and at best delivers negligible logarithmic improvements in sample size or dimension. As a result, these commonly used estimators can exhibit poor performance in practice, and even be inconsistent in some settings. Our theoretical insights are empirically validated through simulations.