Decision Tree Learning
Are Trees Really Green? A Detection Approach of IoT Malware Attacks
Sanna, Silvia Lucia, Soi, Diego, Maiorca, Davide, Giacinto, Giorgio
Nowadays, the Internet of Things (IoT) is widely employed, and its usage is growing exponentially because it facilitates remote monitoring, predictive maintenance, and data-driven decision making, especially in the healthcare and industrial sectors. However, IoT devices remain vulnerable due to their resource constraints and difficulty in applying security patches. Consequently, various cybersecurity attacks are reported daily, such as Denial of Service, particularly in IoT-driven solutions. Most attack detection methodologies are based on Machine Learning (ML) techniques, which can detect attack patterns. However, the focus is more on identification rather than considering the impact of ML algorithms on computational resources. This paper proposes a green methodology to identify IoT malware networking attacks based on flow privacy-preserving statistical features. In particular, the hyperparameters of three tree-based models -- Decision Trees, Random Forest and Extra-Trees -- are optimized based on energy consumption and test-time performance in terms of Matthew's Correlation Coefficient. Our results show that models maintain high performance and detection accuracy while consistently reducing power usage in terms of watt-hours (Wh). This suggests that on-premise ML-based Intrusion Detection Systems are suitable for IoT and other resource-constrained devices.
Even Faster Hyperbolic Random Forests: A Beltrami-Klein Wrapper Approach
Chlenski, Philippe, Pe'er, Itsik
Decision trees and models that use them as primitives are workhorses of machine learning in Euclidean spaces. Recent work has further extended these models to the Lorentz model of hyperbolic space by replacing axis-parallel hyperplanes with homogeneous hyperplanes when partitioning the input space. In this paper, we show how the hyperDT algorithm can be elegantly reexpressed in the Beltrami-Klein model of hyperbolic spaces. This preserves the thresholding operation used in Euclidean decision trees, enabling us to further rewrite hyperDT as simple pre- and post-processing steps that form a wrapper around existing tree-based models designed for Euclidean spaces. The wrapper approach unlocks many optimizations already available in Euclidean space models, improving flexibility, speed, and accuracy while offering a simpler, more maintainable, and extensible codebase. Our implementation is available at https://github.com/pchlenski/hyperdt.
Interpretable reinforcement learning for heat pump control through asymmetric differentiable decision trees
Van Puyvelde, Toon, Zareh, Mehran, Develder, Chris
In recent years, deep reinforcement learning (DRL) algorithms have gained traction in home energy management systems. However, their adoption by energy management companies remains limited due to the black-box nature of DRL, which fails to provide transparent decision-making feedback. To address this, explainable reinforcement learning (XRL) techniques have emerged, aiming to make DRL decisions more transparent. Among these, soft differential decision tree (DDT) distillation provides a promising approach due to the clear decision rules they are based on, which can be efficiently computed. However, achieving high performance often requires deep, and completely full, trees, which reduces interpretability. To overcome this, we propose a novel asymmetric soft DDT construction method. Unlike traditional soft DDTs, our approach adaptively constructs trees by expanding nodes only when necessary. This improves the efficient use of decision nodes, which require a predetermined depth to construct full symmetric trees, enhancing both interpretability and performance. We demonstrate the potential of asymmetric DDTs to provide transparent, efficient, and high-performing decision-making in home energy management systems.
TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning
Luo, Jiaqi, Yuan, Yuan, Xu, Shixin
Tabular-image multimodal learning, which integrates structured tabular data with imaging data, holds great promise for a variety of tasks, especially in medical applications. Yet, two key challenges remain: (1) the lack of a standardized, pretrained representation for tabular data, as is commonly available in vision and language domains; and (2) the difficulty of handling missing values in the tabular modality, which are common in real-world medical datasets. To address these issues, we propose the T abPFN-Integrated M ultimodal E ngine ( TIME), a novel multimodal framework that builds on the recently introduced tabular foundation model, TabPFN. TIME leverages TabPFN as a frozen tabular encoder to generate robust, strong embeddings that are naturally resilient to missing data, and combines them with image features from pretrained vision backbones. Extensive experiments demonstrate that TIME consistently outperforms competitive baselines across both complete and incomplete tabular inputs, underscoring its practical value in real-world mul-timodal learning scenarios. Keywords: Multimodal Learning, Tabular-Image, Pretrained Model, TabPFN 1. Introduction Multimodal learning has emerged as a powerful paradigm for integrating diverse data sources to enhance learning and decision-making across a wide range of domains [1]. Among the many forms of multimodal integration, tabular-image multimodal learning plays a uniquely important role, especially in clinical and biomedical applications [2, 3]. In such settings, structured tabular data such as laboratory test results often coexist with unstructured imaging data like X-rays.
Review for NeurIPS paper: Model Class Reliance for Random Forests
Weaknesses: The main concern I have with the paper is in the argument that the estimator does in fact converge to MCR and MCR- for random forests. Section 4.1 provides an argument that, as the number of trees goes to infinity, each tree will be replaced with one from its Rashomon set that is maximally dependent on X1 (when doing the MCR procedure). In finite samples, and with a finite number of trees, there are reasons to doubt whether this method provides consistent estimation of MCR for the random forest as a whole. The favorable generalization properties of random forests are known to be derived from the diversity of trees in the ensemble: a well known result of Breiman is that the generalization error decreases as the correlation of the residuals from the trees decreases. While the predictions of a tree and its surrogate may be identical for a given dataset, replacing a tree with the surrogate seems that it may decrease the expected generalization error of the tree as a whole.
Review for NeurIPS paper: Decision trees as partitioning machines to characterize their generalization properties
Weaknesses: * From the theoretical analysis, the main weakness might be the analysis on pure continuous features. Nowadays, it is very unlikely to have this scenario in the most challenging machine learning problems. Thus, the theoretical implications can be limited due to this. From the empirical evaluations, I am curious to know why the method was only compared to a very old algorithm such as CART. Is it the only reasonable algorithm out there for decision trees that can be comparable to the method proposed?
LLMPR: A Novel LLM-Driven Transfer Learning based Petition Ranking Model
Gayen, Avijit, Chakraborty, Somyajit, Sen, Mainak, Paul, Soham, Jana, Angshuman
The persistent accumulation of unresolved legal cases, especially within the Indian judiciary, significantly hampers the timely delivery of justice. Manual methods of prioritizing petitions are often prone to inefficiencies and subjective biases further exacerbating delays. To address this issue, we propose LLMPR (Large Language Model-based Petition Ranking), an automated framework that utilizes transfer learning and machine learning to assign priority rankings to legal petitions based on their contextual urgency. Leveraging the ILDC dataset comprising 7,593 annotated petitions, we process unstructured legal text and extract features through various embedding techniques, including DistilBERT, LegalBERT, and MiniLM. These textual embeddings are combined with quantitative indicators such as gap days, rank scores, and word counts to train multiple machine learning models, including Random Forest, Decision Tree, XGBoost, LightGBM, and CatBoost. Our experiments demonstrate that Random Forest and Decision Tree models yield superior performance, with accuracy exceeding 99% and a Spearman rank correlation of 0.99. Notably, models using only numerical features achieve nearly optimal ranking results (R2 = 0.988, \r{ho} = 0.998), while LLM-based embeddings offer only marginal gains. These findings suggest that automated petition ranking can effectively streamline judicial workflows, reduce case backlog, and improve fairness in legal prioritization.
Autoencoding Random Forests
Vu, Binh Duc, Kapar, Jan, Wright, Marvin, Watson, David S.
We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.
Interpretable DNFs
Cooper, Martin C., Bousdira, Imane, Carbonnel, Clément
A classifier is considered interpretable if each of its decisions has an explanation which is small enough to be easily understood by a human user. A DNF formula can be seen as a binary classifier $κ$ over boolean domains. The size of an explanation of a positive decision taken by a DNF $κ$ is bounded by the size of the terms in $κ$, since we can explain a positive decision by giving a term of $κ$ that evaluates to true. Since both positive and negative decisions must be explained, we consider that interpretable DNFs are those $κ$ for which both $κ$ and $\overlineκ$ can be expressed as DNFs composed of terms of bounded size. In this paper, we study the family of $k$-DNFs whose complements can also be expressed as $k$-DNFs. We compare two such families, namely depth-$k$ decision trees and nested $k$-DNFs, a novel family of models. Experiments indicate that nested $k$-DNFs are an interesting alternative to decision trees in terms of interpretability and accuracy.