Decision Tree Learning
MonoForest framework for tree ensemble analysis
In this work, we introduce a new decision tree ensemble representation framework: instead of using a graph model we transform each tree into a well-known polynomial form. We apply the new representation to three tasks: theoretical analysis, model reduction, and interpretation. The polynomial form of a tree ensemble allows a straightforward interpretation of the original model. In our experiments, it shows comparable results with state-of-the-art interpretation techniques. Another application of the framework is the ensemble-wise pruning: we can drop monomials from the polynomial, based on train data statistics. This way we reduce the model size up to 3 times without loss of its quality. It is possible to show the equivalence of tree shape classes that share the same polynomial. This fact gives us the ability to train a model in one tree's shape and exploit it in another, which is easier for computation or interpretation. We formulate a problem statement for optimal tree ensemble translation from one form to another and build a greedy solution to this problem.
On the Power of Decision Trees in Auto-Regressive Language Modeling
Originally proposed for handling time series data, Auto-regressive Decision Trees (ARDTs) have not yet been explored for language modeling. This paper explores both the theoretical and practical applications of ARDTs in this new context. We theoretically demonstrate that ARDTs can compute complex functions, such as simulating automata, Turing machines, and sparse circuits, by leveraging "chainof-thought" computations. Our analysis provides bounds on the size, depth, and computational efficiency of ARDTs, highlighting their surprising computational power. Empirically, we train ARDTs on simple language generation tasks, showing that they can learn to generate coherent and grammatically correct text on par with a smaller Transformer model. Additionally, we show that ARDTs can be used on top of transformer representations to solve complex reasoning tasks. This research reveals the unique computational abilities of ARDTs, aiming to broaden the architectural diversity in language model development.
Supplementary Material for the Paper " Joints in Random Forests "
Let f be a DT classifier and p(Y | x) be a corresponding GeDT classifier, where each leaf in GeDT is class-factorised, i.e. of the form p(Y)p(X), and where p(Y) has been estimated in the maximum-likelihood sense. Then f(x) = p(Y | x), provided that p(x) > 0. Proof. Recall that the leaves in the GeDT are in one-to-one correspondence with the leaf cells A of the DT, and that the support of any leaf is given by its corresponding A A. Let v Since the GeDT is deterministic, it has at most one non-zero child. Before proving Theorem 2 we need to introduce some background. This theorem extends consistency results for collections of partitions of the state space X, as discussed by Lugosi and Nobel [12].
Smooth And Consistent Probabilistic Regression Trees
We propose here a generalization of regression trees, referred to as Probabilistic Regression (PR) trees, that adapt to the smoothness of the prediction function relating input and output variables while preserving the interpretability of the prediction and being robust to noise. In PR trees, an observation is associated to all regions of a tree through a probability distribution that reflects how far the observation is to a region. We show that such trees are consistent, meaning that their error tends to 0 when the sample size tends to infinity, a property that has not been established for similar, previous proposals as Soft trees and Smooth Transition Regression trees. We further explain how PR trees can be used in different ensemble methods, namely Random Forests and Gradient Boosted Trees. Lastly, we assess their performance through extensive experiments that illustrate their benefits in terms of performance, interpretability and robustness to noise.
8289889263db4a40463e3f358bb7c7a1-AuthorFeedback.pdf
First of all, we thank the reviewers for their helpful comments and remarks. Moore-Penrose pseudo-inverse which explains why the complexity is only quadratic in K). The notation φ in Eq. 3 means that φ is a multivariate function of the p variables Reviewer 1. Ooops, you are right: The expectation in the expression of a Reviewer 2. An important difference wrt to the work by Gérard Biau, Luc Devroye and Gabor Lugosi ([1]) is that we Adaptative Neural Trees ([2]) and Deep Neural Decision Forests ([3]) are both built from decision trees. Reviewer 3. It is true that a standard regression tree with enough leaves can also approximate a smooth link function. We'll modify lines 228-230 as Reviewer 4. One can obtain standard regression trees from Eqs 2 and 3 by setting φ to (2π) In that case, the distribution of x over regions is concentrated on one region.
Tree in Tree: from Decision Trees to Decision Graphs
Decision trees have been widely used as classifiers in many machine learning applications thanks to their lightweight and interpretable decision process. This paper introduces Tree in Tree decision graph (TnT), a framework that extends the conventional decision tree to a more generic and powerful directed acyclic graph.
439d8c975f26e5005dcdbf41b0d84161-AuthorFeedback.pdf
We thank the reviewers for their thoughtful and valuable feedback. Response to Reviewer 1: We begin by responding to Reviewer #1's remark that the notion of estimating learnability is "The results are not surprising at all. That does not mean that it's easy to prove, but it is still not surprising that it is "Also, the paper makes strong monotonicity assumptions, but does not discuss the implications of it on the strength (and We thank the reviewer for raising this point. Response to Reviewer 2: We thank Reviewer #2 for suggestions for improving our presentation. We thank the reviewer for their question about overall complexity.
Appendix
Organization The supplementary material is organized as follows: Section A presents a brief review of the concepts concerning first-order logic that are used in our work. Section C presents a proof of Theorem 1, our negative result concerning decision trees and OBDDs, while Section D is devoted to our positive result. Next, Section F presents a proof of Theorem 3, implying the full tractability of FOIL for a restricted class of OBDDs. Section G discusses details of the practical implementation, while Section H explains the the methodology of our experiments. Then, Section I discusses details of the high-level version we implemented, and also presents several examples of queries for the Student Performance Data Set which serve to show the usability of our implementation in practice. Finally Section J explains the binarization process for real-valued decision trees and high-level queries. We review the definition of first-order logic (FO) over vocabularies consisting only of relations. We assume the existence of a countably infinite set of variables {x, y, z,... }, possibly with subscripts. The set of FO-formulas over σ is inductively defined as follows. If x, y are variables, then x = y is an FO-formula over σ. 2. If relation symbol R σ has arity n > 0 and x If ϕ, ψ are FO-formulas over σ, then ( ϕ), (ϕ ψ), and (ϕ ψ) are FO-formulas over σ. 4. If x is a variable and ϕ is an FO-formula over σ, then ( x ϕ) and ( x ϕ) are FO-formulas over σ.