Decision Tree Learning
A guide to feature engineering in time series with Tsfresh
Feature engineering plays a crucial role in many of the data modelling tasks. This is simply a process that defines important features of the data using which a model can enhance its performance. In time series modelling, feature engineering works in a different way because it is sequential data and it gets formed using the changes in any values according to the time. In this article, we are going to discuss feature engineering in time series and also we will cover an implementation of feature engineering in time series using a package called tsfresh. The major points to be discussed in the article are listed below.
Deep Dynamic Boosted Forest
Wang, Haixin, Ren, Xingzhang, Sun, Jinan, Ye, Wei, Chen, Long, Yu, Muzhi, Zhang, Shikun
Random forest is widely exploited as an ensemble learning method. In many practical applications, however, there is still a significant challenge to learn from imbalanced data. To alleviate this limitation, we propose a deep dynamic boosted forest (DDBF), a novel ensemble algorithm that incorporates the notion of hard example mining into random forest. Specifically, we propose to measure the quality of each leaf node of every decision tree in the random forest to determine hard examples. By iteratively training and then removing easy examples from training data, we evolve the random forest to focus on hard examples dynamically so as to balance the proportion of samples and learn decision boundaries better. Data can be cascaded through these random forests learned in each iteration in sequence to generate more accurate predictions. Our DDBF outperforms random forest on 5 UCI datasets, MNIST and SATIMAGE, and achieved state-of-the-art results compared to other deep models. Moreover, we show that DDBF is also a new way of sampling and can be very useful and efficient when learning from imbalanced data.
SurvSet: An open-source time-to-event dataset repository
Time-to-event (T2E) analysis is a branch of statistics that models the duration of time it takes for an event to occur. Such events can include outcomes like death, unemployment, or product failure. Most modern machine learning (ML) algorithms, like decision trees and kernel methods, are supported for T2E modelling with data science software (python and R). To complement these developments, SurvSet is the first open-source T2E dataset repository designed for a rapid benchmarking of ML algorithms and statistical methods. The data in SurvSet have been consistently formatted so that a single preprocessing method will work for all datasets. SurvSet currently has 76 datasets which vary in dimensionality, time dependency, and background (the majority of which come from biomedicine). SurvSet is available on PyPI and can be installed with pip install SurvSet. R users can download the data directly from the corresponding git repository.
Fuzzy Forests For Feature Selection in High-Dimensional Survey Data: An Application to the 2020 U.S. Presidential Election
Dey, Sreemanti, Alvarez, R. Michael
An increasingly common methodological issue in the field of social science is high-dimensional and highly correlated datasets that are unamenable to the traditional deductive framework of study. Analysis of candidate choice in the 2020 Presidential Election is one area in which this issue presents itself: in order to test the many theories explaining the outcome of the election, it is necessary to use data such as the 2020 Cooperative Election Study Common Content, with hundreds of highly correlated features. We present the Fuzzy Forests algorithm, a variant of the popular Random Forests ensemble method, as an efficient way to reduce the feature space in such cases with minimal bias, while also maintaining predictive performance on par with common algorithms like Random Forests and logit. Using Fuzzy Forests, we isolate the top correlates of candidate choice and find that partisan polarization was the strongest factor driving the 2020 presidential election. Social science research today often encounters a difficult methodological situation -- larger and larger datasets, which contain high-dimensional features, which are highly correlated [7]. Quite literally, as in the application we discuss in our paper (the 2020 U.S Presidential election), to test the many different theories and potential explanations for why voters decided to remove then President Trump from office, researchers need to use methodologies that can quickly and efficiently reduce the feature space from hundreds of possible features to a smaller set that can then be the focus of further study. In our paper we present a variant of the popular Random Forest, Fuzzy Forests, which we argue is well suited for exactly this type of applied machine learning problem [6]. Fuzzy Forests are ideal for feature selection in large and high-dimensional datasets, where the features are highly correlated.
Random Forest Regression
A few weeks ago, I wrote an article demonstrating random forest classification models. In this article, we will demonstrate the regression case of random forest using sklearn's RandomForrestRegressor() model. Similarly to my last article, I will begin this article by highlighting some definitions and terms relating to and comprising the backbone of the random forest machine learning. The goal of this article is to describe the random forest model, and demonstrate how it can be applied using the sklearn package. Our goal will not be to solve for the most optimal solution as this is just a basic guide.
Making use of supercomputers in financial machine learning
Cotte, Philippe, Lagier, Pierre, Margot, Vincent, Geissler, Christophe
This article is the result of a collaboration between Fujitsu and Advestis. This collaboration aims at refactoring and running an algorithm based on systematic exploration producing investment recommendations on a high-performance computer of the Fugaku type [11], to see whether a very high number of cores could allow for a deeper exploration of the data compared to a cloud machine, hopefully resulting in better predictions. We found that an increase in the number of explored rules results in a net increase in the predictive performance of the final ruleset. Also, in the particular case of this study, we found that using more than around 40 cores does not bring a significant computation time gain. However, the origin of this limitation is explained by a threshold-based search heuristic used to prune the search space. We have evidence that for similar data sets with less restrictive thresholds, the number of cores actually used could very well be much higher, allowing parallelization to have a much greater effect.
KL Divergence Estimation with Multi-group Attribution
Gopalan, Parikshit, Narodytska, Nina, Reingold, Omer, Sharan, Vatsal, Wieder, Udi
Estimating the Kullback-Leibler (KL) divergence between two distributions given samples from them is well-studied in machine learning and information theory. Motivated by considerations of multi-group fairness, we seek KL divergence estimates that accurately reflect the contributions of sub-populations to the overall divergence. We model the sub-populations coming from a rich (possibly infinite) family $\mathcal{C}$ of overlapping subsets of the domain. We propose the notion of multi-group attribution for $\mathcal{C}$, which requires that the estimated divergence conditioned on every sub-population in $\mathcal{C}$ satisfies some natural accuracy and fairness desiderata, such as ensuring that sub-populations where the model predicts significant divergence do diverge significantly in the two distributions. Our main technical contribution is to show that multi-group attribution can be derived from the recently introduced notion of multi-calibration for importance weights [HKRR18, GRSW21]. We provide experimental evidence to support our theoretical results, and show that multi-group attribution provides better KL divergence estimates when conditioned on sub-populations than other popular algorithms.
Visualizing Decision Trees with Pybaobabdt
Data visualization is the language of decision-making. Good charts effectively convey information. Decision trees can be visualized in multiple ways. Take, for instance, the indentation nodes where every internal and leaf node is depicted as text, while the parent-child relationship is shown by indenting the child with respect to the parent. Then there is the node-link diagram. It is one of the most commonly used methods to visualize decision trees where the nodes are represented via glyphs, and parent and child nodes are connected through links.
Top resources to learn decision trees in 2022
Decision trees are a supervised learning method used to build a model that predicts the value of a target variable by learning simple decision rules from the data features. DTs are used for both classification and regression and are simple to understand and interpret. Below, we have listed down the top online courses, YouTube videos and guides for enthusiasts to master decision trees. The course by CodeAcademy focuses on teaching developers how to build and use decision trees and random forests. The course looks at two methods in detail: Gini impurity and Information Gain.
On Uncertainty Estimation by Tree-based Surrogate Models in Sequential Model-based Optimization
Sequential model-based optimization sequentially selects a candidate point by constructing a surrogate model with the history of evaluations, to solve a black-box optimization problem. Gaussian process (GP) regression is a popular choice as a surrogate model, because of its capability of calculating prediction uncertainty analytically. On the other hand, an ensemble of randomized trees is another option and has practical merits over GPs due to its scalability and easiness of handling continuous/discrete mixed variables. In this paper we revisit various ensembles of randomized trees to investigate their behavior in the perspective of prediction uncertainty estimation. Then, we propose a new way of constructing an ensemble of randomized trees, referred to as BwO forest, where bagging with oversampling is employed to construct bootstrapped samples that are used to build randomized trees with random splitting. Experimental results demonstrate the validity and good performance of BwO forest over existing tree-based models in various circumstances.