NRGBoost: Energy-Based Generative Boosted Trees

Bravo, João

arXiv.org Artificial Intelligence 

Despite the rise to dominance of deep learning in unstructured data domains, treebased methods such as Random Forests (RF) and Gradient Boosted Decision Trees (GBDT) are still the workhorses for handling discriminative tasks on tabular data. We explore generative extensions of these popular algorithms with a focus on explicitly modeling the data density (up to a normalization constant), thus enabling other applications besides sampling. As our main contribution we propose an energy-based generative boosting algorithm that is analogous to the second order boosting implemented in popular packages like XGBoost. We show that, despite producing a generative model capable of handling inference tasks over any input variable, our proposed algorithm can achieve similar discriminative performance to GBDT on a number of real world tabular datasets, outperforming alternative generative approaches. At the same time, we show that it is also competitive with neural network based models for sampling. Generative models have achieved tremendous success in computer vision and natural language processing, where the ability to generate synthetic data guided by user prompts opens up many exciting possibilities. While generating synthetic table records does not necessarily enjoy the same wide appeal, this problem has still received considerable attention as a potential avenue for bypassing privacy concerns when sharing data. Estimating the data density, p(x), is another typical application of generative models which enables a host of different use cases that can be particularly interesting for tabular data. Unlike discriminative models which are trained to perform inference over a single target variable, density models can be used more flexibly for inference over different variables or for out of distribution detection. They can also handle inference with missing data in a principled way by marginalizing over unobserved variables. The development of generative models for tabular data has mirrored its progression in computer vision with many of its Deep Learning (DL) approaches being adapted to the tabular domain (Jordon et al., 2018; Xu et al., 2019; Fan et al., 2020; Engelmann & Lessmann, 2021; Zhao et al., 2021; Kotelnikov et al., 2023). Unfortunately, these methods are only useful for sampling as they either don't model the density explicitly or can't evaluate it due to untractable marginalization over high dimensional latent variable spaces.