AITopics

2509.21465

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

arXiv.org Artificial IntelligenceSep-26-2025

Actively Learning Halfspaces without Synthetic Data

Black, Hadley, Larsen, Kasper Green, Mazumdar, Arya, Saha, Barna, So, Geelon

In the classic point location problem, one is given an arbitrary dataset $X \subset \mathbb{R}^d$ of $n$ points with query access to an unknown halfspace $f : \mathbb{R}^d \to \{0,1\}$, and the goal is to learn the label of every point in $X$. This problem is extremely well-studied and a nearly-optimal $\widetilde{O}(d \log n)$ query algorithm is known due to Hopkins-Kane-Lovett-Mahajan (FOCS 2020). However, their algorithm is granted the power to query arbitrary points outside of $X$ (point synthesis), and in fact without this power there is an $Ω(n)$ query lower bound due to Dasgupta (NeurIPS 2004). In this work our goal is to design efficient algorithms for learning halfspaces without point synthesis. To circumvent the $Ω(n)$ lower bound, we consider learning halfspaces whose normal vectors come from a set of size $D$, and show tight bounds of $Θ(D + \log n)$. As a corollary, we obtain an optimal $O(d + \log n)$ query deterministic learner for axis-aligned halfspaces, closing a previous gap of $O(d \log n)$ vs. $Ω(d + \log n)$. In fact, our algorithm solves the more general problem of learning a Boolean function $f$ over $n$ elements which is monotone under at least one of $D$ provided orderings. Our technical insight is to exploit the structure in these orderings to perform a binary search in parallel rather than considering each ordering sequentially, and we believe our approach may be of broader interest. Furthermore, we use our exact learning algorithm to obtain nearly optimal algorithms for PAC-learning. We show that $O(\min(D + \log(1/\varepsilon), 1/\varepsilon) \cdot \log D)$ queries suffice to learn $f$ within error $\varepsilon$, even in a setting when $f$ can be adversarially corrupted on a $c\varepsilon$-fraction of points, for a sufficiently small constant $c$. This bound is optimal up to a $\log D$ factor, including in the realizable setting.

algorithm, artificial intelligence, machine learning, (17 more...)

2509.20848

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.66)

arXiv.org Machine LearningSep-25-2025

Hierarchical Bayesian Operator-induced Symbolic Regression Trees for Structural Learning of Scientific Expressions

Roy, Somjit, Dey, Pritam, Pati, Debdeep, Mallick, Bani K.

The advent of Scientific Machine Learning has heralded a transformative era in scientific discovery, driving progress across diverse domains. Central to this progress is uncovering scientific laws from experimental data through symbolic regression. However, existing approaches are dominated by heuristic algorithms or data-hungry black-box methods, which often demand low-noise settings and lack principled uncertainty quantification. Motivated by interpretable Statistical Artificial Intelligence, we develop a hierarchical Bayesian framework for symbolic regression that represents scientific laws as ensembles of tree-structured symbolic expressions endowed with a regularized tree prior. This coherent probabilistic formulation enables full posterior inference via an efficient Markov chain Monte Carlo algorithm, yielding a balance between predictive accuracy and structural parsimony. To guide symbolic model selection, we develop a marginal posterior-based criterion adhering to the Occam's window principle and further quantify structural fidelity to ground truth through a tailored expression-distance metric. On the theoretical front, we establish near-minimax rate of Bayesian posterior concentration, providing the first rigorous guarantee in context of symbolic regression. Empirical evaluation demonstrates robust performance of our proposed methodology against state-of-the-art competing modules on a simulated example, a suite of canonical Feynman equations, and single-atom catalysis dataset.

expression, hierbosss, main manuscript, (14 more...)

2509.1971

Country:

North America > United States > Wisconsin > Dane County > Madison (0.14)
North America > United States > Texas > Brazos County > College Station (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York (0.04)

Genre: Research Report (0.81)

Industry: Health & Medicine (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)
(3 more...)

Panagiotou, Emmanouil, Ronval, Benoît, Roy, Arjun, Bothmann, Ludwig, Bischl, Bernd, Nijssen, Siegfried, Ntoutsi, Eirini

TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees

arXiv.org Artificial IntelligenceSep-25-2025

Ensuring fairness in machine learning remains a significant challenge, as models often inherit biases from their training data. Generative models have recently emerged as a promising approach to mitigate bias at the data level while preserving utility. However, many rely on deep architectures, despite evidence that simpler models can be highly effective for tabular data. In this work, we introduce TABFAIRGDT, a novel method for generating fair synthetic tabular data using autoregressive decision trees. To enforce fairness, we propose a soft leaf resampling technique that adjusts decision tree outputs to reduce bias while preserving predictive performance. Our approach is non-parametric, effectively capturing complex relationships between mixed feature types, without relying on assumptions about the underlying data distributions. We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models, achieving better fairness-utility trade-off for downstream tasks, as well as higher synthetic data quality. Moreover, our method is lightweight, highly efficient, and CPU-compatible, requiring no data pre-processing. Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes, and can generate fair synthetic data for medium-sized datasets (10 features, 10K samples) in just one second on a standard CPU, making it an ideal solution for real-world fairness-sensitive applications.

artificial intelligence, fairness, machine learning, (17 more...)

2509.19927

Country:

North America > United States (0.68)
Europe > Germany (0.46)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.54)

Industry: Government > Regional Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

arXiv.org Artificial IntelligenceSep-23-2025

Medical priority fusion: achieving dual optimization of sensitivity and interpretability in nipt anomaly detection

Ge, Xiuqi, Yao, Zhibo, Du, Yaosong

Clinical machine learning faces a critical dilemma in high-stakes medical applications: algorithms achieving optimal diagnostic performance typically sacrifice the interpretability essential for physician decision-making, while interpretable methods compromise sensitivity in complex scenarios. This paradox becomes particularly acute in non-invasive prenatal testing (NIPT), where missed chromosomal abnormalities carry profound clinical consequences yet regulatory frameworks mandate explainable AI systems. We introduce Medical Priority Fusion (MPF), a constrained multi-objective optimization framework that resolves this fundamental trade-off by systematically integrating Naive Bayes probabilistic reasoning with Decision Tree rule-based logic through mathematically-principled weighted fusion under explicit medical constraints. Rigorous validation on 1,687 real-world NIPT samples characterized by extreme class imbalance (43.4:1 normal-to-abnormal ratio) employed stratified 5-fold cross-validation with comprehensive ablation studies and statistical hypothesis testing using McNemar's paired comparisons. MPF achieved simultaneous optimization of dual objectives: 89.3% sensitivity (95% CI: 83.9-94.7%) with 80% interpretability score, significantly outperforming individual algorithms (McNemar's test, p < 0.001). The optimal fusion configuration achieved Grade A clinical deployment criteria with large effect size (d = 1.24), establishing the first clinically-deployable solution that maintains both diagnostic accuracy and decision transparency essential for prenatal care. This work demonstrates that medical-constrained algorithm fusion can resolve the interpretability-performance trade-off, providing a mathematical framework for developing high-stakes medical decision support systems that meet both clinical efficacy and explainability requirements.

artificial intelligence, machine learning, medical priority fusion, (18 more...)

2509.17924

Country: North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.87)

Industry:

Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(2 more...)

Tang, Tiffany M., Levina, Elizaveta, Zhu, Ji

Interpretable Network-assisted Random Forest+

arXiv.org Machine LearningSep-22-2025

Machine learning algorithms often assume that training samples are independent. When data points are connected by a network, the induced dependency between samples is both a challenge, reducing effective sample size, and an opportunity to improve prediction by leveraging information from network neighbors. Multiple methods taking advantage of this opportunity are now available, but many, including graph neural networks, are not easily interpretable, limiting their usefulness for understanding how a model makes its predictions. Others, such as network-assisted linear regression, are interpretable but often yield substantially worse prediction performance. We bridge this gap by proposing a family of flexible network-assisted models built upon a generalization of random forests (RF+), which achieves highly-competitive prediction accuracy and can be interpreted through feature importance measures. In particular, we develop a suite of interpretation tools that enable practitioners to not only identify important features that drive model predictions, but also quantify the importance of the network contribution to prediction. Importantly, we provide both global and local importance measures as well as sample influence measures to assess the impact of a given observation. This suite of tools broadens the scope and applicability of network-assisted machine learning for high-impact problems where interpretability and transparency are essential.

feature importance, penalty, prediction, (15 more...)

2509.15611

Country:

North America > United States > Michigan (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine (0.93)
Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.49)

Lee, Jaeho, Kim, Kangjin, Lee, Gyeong Taek

Next-Depth Lookahead Tree

arXiv.org Machine LearningSep-19-2025

This paper proposes the Next-Depth Lookahead Tree (NDLT), a single-tree model designed to improve performance by evaluating node splits not only at the node being optimized but also by evaluating the quality of the next depth level. Conventional decision trees (DTs) rely on greedy node-by-node partitioning, which often fails to ensure global optimality of the tree and is prone to local optima when early splits are suboptimal. To overcome this limitation, NDLT employs a next-depth lookahead strategy that jointly considers the immediate impurity reduction at the parent node and the expected impurity reduction at its child nodes. Empirical evaluation on diverse and complex datasets, including high-dimensional and imbalanced cases, demonstrates that NDLT achieves performance comparable to or better than classical DTs and ensemble models such as Random Forests, XGBoost, and Light-GBM. These results show that NDLT preserves the interpretability of a single tree while delivering robust predictive accuracy, making it an effective approach for real-world applications where both transparency and performance are required. These authors contributed equally to this work. Introduction Decision Tree (DT) represents one of the earliest and most established algorithms in the field of machine learning, valued for its simplicity, inter-pretability, and broad applicability [1].

dataset, ndlt, node, (15 more...)

2509.15143

Country:

Europe > Germany > Bavaria > Regensburg (0.05)
North America > United States > Wisconsin (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

arXiv.org Artificial IntelligenceSep-19-2025

Evaluating Supervised Learning Models for Fraud Detection: A Comparative Study of Classical and Deep Architectures on Imbalanced Transaction Data

Wang, Chao, Nie, Chuanhao, Liu, Yunbo

Fraud detection remains a critical task in high-stakes domains such as finance and e-commerce, where undetected fraudulent transactions can lead to significant economic losses. In this study, we systematically compare the performance of four supervised learning models - Logistic Regression, Random Forest, Light Gradient Boosting Machine (LightGBM), and a Gated Recurrent Unit (GRU) network - on a large-scale, highly imbalanced online transaction dataset. While ensemble methods such as Random Forest and LightGBM demonstrated superior performance in both overall and class-specific metrics, Logistic Regression offered a reliable and interpretable baseline. The GRU model showed strong recall for the minority fraud class, though at the cost of precision, highlighting a trade-off relevant for real-world deployment. Our evaluation emphasizes not only weighted averages but also per-class precision, recall, and F1-scores, providing a nuanced view of each model's effectiveness in detecting rare but consequential fraudulent activity. The findings underscore the importance of choosing models based on the specific risk tolerance and operational needs of fraud detection systems.

artificial intelligence, logistic regression, machine learning, (17 more...)

2505.22521

Country: North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Industry:

Law Enforcement & Public Safety > Fraud (1.00)
Information Technology > Services > e-Commerce Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.91)

arXiv.org Artificial IntelligenceSep-17-2025

ZTree: A Subgroup Identification Based Decision Tree Learning Framework

Cheng, Eric, Cheng, Jie

Decision trees are a commonly used class of machine learning models valued for their interpretability and versatility, capable of both classification and regression. We propose ZTree, a novel decision tree learning framework that replaces CART's traditional purity based splitting with statistically principled subgroup identification. At each node, ZTree applies hypothesis testing (e.g., z-tests, t-tests, Mann-Whitney U, log-rank) to assess whether a candidate subgroup differs meaningfully from the complement. To adjust for the complication of multiple testing, we employ a cross-validation-based approach to determine if further node splitting is needed. This robust stopping criterion eliminates the need for post-pruning and makes the test threshold (z-threshold) the only parameter for controlling tree complexity. Because of the simplicity of the tree growing procedure, once a detailed tree is learned using the most lenient z-threshold, all simpler trees can be derived by simply removing nodes that do not meet the larger z-thresholds. This makes parameter tuning intuitive and efficient. Furthermore, this z-threshold is essentially a p-value, allowing users to easily plug in appropriate statistical tests into our framework without adjusting the range of parameter search. Empirical evaluation on five large-scale UCI datasets demonstrates that ZTree consistently delivers strong performance, especially at low data regimes. Compared to CART, ZTree also tends to grow simpler trees without sacrificing performance. ZTree introduces a statistically grounded alternative to traditional decision tree splitting by leveraging hypothesis testing and a cross-validation approach to multiple testing correction, resulting in an efficient and flexible framework.

artificial intelligence, machine learning, threshold, (16 more...)

2509.12688

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Cattaneo, Matias D., Klusowski, Jason M., Yu, Ruiqi Rae

The Honest Truth About Causal Trees: Accuracy Limits for Heterogeneous Treatment Effect Estimation

arXiv.org Machine LearningSep-16-2025

Recursive decision trees have emerged as a leading methodology for heterogeneous causal treatment effect estimation and inference in experimental and observational settings. These procedures are fitted using the celebrated CART (Classification And Regression Tree) algorithm [Breiman et al., 1984], or custom variants thereof, and hence are believed to be "adaptive" to high-dimensional data, sparsity, or other specific features of the underlying data generating process. Athey and Imbens [2016] proposed several "honest" causal decision tree estimators, which have become the standard in both academia and industry. We study their estimators, and variants thereof, and establish lower bounds on their estimation error. We demonstrate that these popular heterogeneous treatment effect estimators cannot achieve a polynomial-in-$n$ convergence rate under basic conditions, where $n$ denotes the sample size. Contrary to common belief, honesty does not resolve these limitations and at best delivers negligible logarithmic improvements in sample size or dimension. As a result, these commonly used estimators can exhibit poor performance in practice, and even be inconsistent in some settings. Our theoretical insights are empirically validated through simulations.

dim, estimator, log log, (14 more...)

2509.11381

Country:

North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > Experimental Study (0.45)

Industry: Government > Regional Government > North America Government > United States Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)