Goto

Collaborating Authors

 classification tree



Decision Tree Embedding by Leaf-Means

Shen, Cencheng, Dong, Yuexiao, Priebe, Carey E.

arXiv.org Machine Learning

Decision trees and random forest remain highly competitive for classification on medium-sized, standard datasets due to their robustness, minimal preprocessing requirements, and interpretability. However, a single tree suffers from high estimation variance, while large ensembles reduce this variance at the cost of substantial computational overhead and diminished interpretability. In this paper, we propose Decision Tree Embedding (DTE), a fast and effective method that leverages the leaf partitions of a trained classification tree to construct an interpretable feature representation. By using the sample means within each leaf region as anchor points, DTE maps inputs into an embedding space defined by the tree's partition structure, effectively circumventing the high variance inherent in decision-tree splitting rules. We further introduce an ensemble extension based on additional bootstrap trees, and pair the resulting embedding with linear discriminant analysis for classification. We establish several population-level theoretical properties of DTE, including its preservation of conditional density under mild conditions and a characterization of the resulting classification error. Empirical studies on synthetic and real datasets demonstrate that DTE strikes a strong balance between accuracy and computational efficiency, outperforming or matching random forest and shallow neural networks while requiring only a fraction of their training time in most cases. Overall, the proposed DTE method can be viewed either as a scalable decision tree classifier that improves upon standard split rules, or as a neural network model whose weights are learned from tree-derived anchor points, achieving an intriguing integration of both paradigms.


A Unified Optimization Framework for Multiclass Classification with Structured Hyperplane Arrangements

Blanco, Víctor, Kothari, Harshit, Luedtke, James

arXiv.org Artificial Intelligence

In this paper, we propose a new mathematical optimization model for multiclass classification based on arrangements of hyperplanes. Our approach preserves the core support vector machine (SVM) paradigm of maximizing class separation while minimizing misclassification errors, and it is computationally more efficient than a previous formulation. We present a kernel-based extension that allows it to construct nonlinear decision boundaries. Furthermore, we show how the framework can naturally incorporate alternative geometric structures, including classification trees, $\ell_p$-SVMs, and models with discrete feature selection. To address large-scale instances, we develop a dynamic clustering matheuristic that leverages the proposed MIP formulation. Extensive computational experiments demonstrate the efficiency of the proposed model and dynamic clustering heuristic, and we report competitive classification performance on both synthetic datasets and real-world benchmarks from the UCI Machine Learning Repository, comparing our method with state-of-the-art implementations available in scikit-learn.




Ensemble of Precision-Recall Curve (PRC) Classification Trees with Autoencoders

Miao, Jiaju, Zhu, Wei

arXiv.org Machine Learning

Anomaly detection underpins critical applications--from network security and intrusion detection to fraud prevention--where recognizing aberrant patterns rapidly is indispensable. Progress in this area is routinely impeded by two obstacles: extreme class imbalance and the curse of dimensionality. To combat the former, we previously introduced Precision-Recall Curve (PRC) classification trees and their ensemble extension, the PRC Random Forest (PRC-RF). Building on that foundation, we now propose a hybrid framework that integrates PRC-RF with autoencoders--unsupervised machine learning methods that learn compact latent representations--to confront both challenges simultaneously. Extensive experiments across diverse benchmark datasets demonstrate that the resulting Autoencoder-PRC-RF model achieves superior accuracy, scalability, and in-terpretability relative to prior methods, affirming its potential for high-stakes anomaly-detection tasks.


A portable diagnosis model for Keratoconus using a smartphone

Li, Yifan, Ho, Peter, Chong, Jo Woon

arXiv.org Artificial Intelligence

Keratoconus (KC) is a corneal disorder that results in blurry and distorted vision. Traditional diagnostic tools, while effective, are often bulky, costly, and require professional operation. In this paper, we present a portable and innovative methodology for diagnosing. Our proposed approach first captures the image reflected on the eye's cornea when a smartphone screen-generated Placido disc sheds its light on an eye, then utilizes a two-stage diagnosis for identifying the KC cornea and pinpointing the location of the KC on the cornea. The first stage estimates the height and width of the Placido disc extracted from the captured image to identify whether it has KC. In this KC identification, k-means clustering is implemented to discern statistical characteristics, such as height and width values of extracted Placido discs, from non-KC (control) and KC-affected groups. The second stage involves the creation of a distance matrix, providing a precise localization of KC on the cornea, which is critical for efficient treatment planning. The analysis of these distance matrices, paired with a logistic regression model and robust statistical analysis, reveals a clear distinction between control and KC groups. The logistic regression model, which classifies small areas on the cornea as either control or KC-affected based on the corresponding inter-disc distances in the distance matrix, reported a classification accuracy of 96.94%, which indicates that we can effectively pinpoint the protrusion caused by KC. This comprehensive, smartphone-based method is expected to detect KC and streamline timely treatment.


Experiments with Optimal Model Trees

Roselli, Sabino Francesco, Frank, Eibe

arXiv.org Artificial Intelligence

Model trees provide an appealing way to perform interpretable machine learning for both classification and regression problems. In contrast to ``classic'' decision trees with constant values in their leaves, model trees can use linear combinations of predictor variables in their leaf nodes to form predictions, which can help achieve higher accuracy and smaller trees. Typical algorithms for learning model trees from training data work in a greedy fashion, growing the tree in a top-down manner by recursively splitting the data into smaller and smaller subsets. Crucially, the selected splits are only locally optimal, potentially rendering the tree overly complex and less accurate than a tree whose structure is globally optimal for the training data. In this paper, we empirically investigate the effect of constructing globally optimal model trees for classification and regression with linear support vector machines at the leaf nodes. To this end, we present mixed-integer linear programming formulations to learn optimal trees, compute such trees for a large collection of benchmark data sets, and compare their performance against greedily grown model trees in terms of interpretability and accuracy. We also compare to classic optimal and greedily grown decision trees, random forests, and support vector machines. Our results show that optimal model trees can achieve competitive accuracy with very small trees. We also investigate the effect on the accuracy of replacing axis-parallel splits with multivariate ones, foregoing interpretability while potentially obtaining greater accuracy.


Modifying Final Splits of Classification Tree for Fine-tuning Subpopulation Target in Policy Making

Wang, Lei Bill, Jiao, Zhenbang, Wang, Fangyi

arXiv.org Machine Learning

Policymakers often use Classification and Regression Trees (CART) to partition populations based on binary outcomes and target subpopulations whose probability of the binary event exceeds a threshold. However, classic CART and knowledge distillation method whose student model is a CART (referred to as KD-CART) do not minimize the misclassification risk associated with classifying the latent probabilities of these binary events. To reduce the misclassification risk, we propose two methods, Penalized Final Split (PFS) and Maximizing Distance Final Split (MDFS). PFS incorporates a tunable penalty into the standard CART splitting criterion function. MDFS maximizes a weighted sum of distances between node means and the threshold. It can point-identify the optimal split under the unique intersect latent probability assumption. In addition, we develop theoretical result for MDFS splitting rule estimation, which has zero asymptotic risk. Through extensive simulation studies, we demonstrate that these methods predominately outperform classic CART and KD-CART in terms of misclassification error. Furthermore, in our empirical evaluations, these methods provide deeper insights than the two baseline methods.


Identifying Dealbreakers and Robust Policies for the Energy Transition Amid Unexpected Events

Coppitters, Diederik, Wiest, Gabriel, Göke, Leonard, Contino, Francesco, Bardow, André, Moret, Stefano

arXiv.org Artificial Intelligence

Disruptions in energy imports, backlash in social acceptance, and novel technologies failing to develop are unexpected events that are often overlooked in energy planning, despite their ability to jeopardize the energy transition. We propose a method to explore unexpected events and assess their impact on the transition pathway of a large-scale whole-energy system. First, we evaluate unexpected events assuming "perfect foresight", where decision-makers can anticipate such events in advance. This allows us to identify dealbreakers, i.e., conditions that make the transition infeasible. Then, we assess the events under "limited foresight" to evaluate the robustness of early-stage decisions against unforeseen unexpected events and the costs associated with managing them. A case study for Belgium demonstrates that a lack of electrofuel imports in 2050 is the main dealbreaker, while accelerating the deployment of renewables is the most robust policy. Our transferable method can help policymakers identify key dealbreakers and devise robust energy transition policies.