AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.30)

arXiv.org Machine LearningMar-6-2022

Deep Dynamic Boosted Forest

Wang, Haixin, Ren, Xingzhang, Sun, Jinan, Ye, Wei, Chen, Long, Yu, Muzhi, Zhang, Shikun

Random forest is widely exploited as an ensemble learning method. In many practical applications, however, there is still a significant challenge to learn from imbalanced data. To alleviate this limitation, we propose a deep dynamic boosted forest (DDBF), a novel ensemble algorithm that incorporates the notion of hard example mining into random forest. Specifically, we propose to measure the quality of each leaf node of every decision tree in the random forest to determine hard examples. By iteratively training and then removing easy examples from training data, we evolve the random forest to focus on hard examples dynamically so as to balance the proportion of samples and learn decision boundaries better. Data can be cascaded through these random forests learned in each iteration in sequence to generate more accurate predictions. Our DDBF outperforms random forest on 5 UCI datasets, MNIST and SATIMAGE, and achieved state-of-the-art results compared to other deep models. Moreover, we show that DDBF is also a new way of sampling and can be very useful and efficient when learning from imbalanced data.

decision tree, hard example, iteration, (15 more...)

1804.0727

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Machine LearningMar-6-2022

SurvSet: An open-source time-to-event dataset repository

Drysdale, Erik

Time-to-event (T2E) analysis is a branch of statistics that models the duration of time it takes for an event to occur. Such events can include outcomes like death, unemployment, or product failure. Most modern machine learning (ML) algorithms, like decision trees and kernel methods, are supported for T2E modelling with data science software (python and R). To complement these developments, SurvSet is the first open-source T2E dataset repository designed for a rapid benchmarking of ML algorithms and statistical methods. The data in SurvSet have been consistently formatted so that a single preprocessing method will work for all datasets. SurvSet currently has 76 datasets which vary in dimensionality, time dependency, and background (the majority of which come from biomedicine). SurvSet is available on PyPI and can be installed with pip install SurvSet. R users can download the data directly from the corresponding git repository.

cran, dataset, package version 1, (10 more...)

2203.03094

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)
Europe > Sweden > Västerbotten County > Umeå (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > Strength High (0.94)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.50)

Dey, Sreemanti, Alvarez, R. Michael

Fuzzy Forests For Feature Selection in High-Dimensional Survey Data: An Application to the 2020 U.S. Presidential Election

arXiv.org Machine LearningMar-5-2022

An increasingly common methodological issue in the field of social science is high-dimensional and highly correlated datasets that are unamenable to the traditional deductive framework of study. Analysis of candidate choice in the 2020 Presidential Election is one area in which this issue presents itself: in order to test the many theories explaining the outcome of the election, it is necessary to use data such as the 2020 Cooperative Election Study Common Content, with hundreds of highly correlated features. We present the Fuzzy Forests algorithm, a variant of the popular Random Forests ensemble method, as an efficient way to reduce the feature space in such cases with minimal bias, while also maintaining predictive performance on par with common algorithms like Random Forests and logit. Using Fuzzy Forests, we isolate the top correlates of candidate choice and find that partisan polarization was the strongest factor driving the 2020 presidential election. Social science research today often encounters a difficult methodological situation -- larger and larger datasets, which contain high-dimensional features, which are highly correlated [7]. Quite literally, as in the application we discuss in our paper (the 2020 U.S Presidential election), to test the many different theories and potential explanations for why voters decided to remove then President Trump from office, researchers need to use methodologies that can quickly and efficiently reduce the feature space from hundreds of possible features to a smaller set that can then be the focus of further study. In our paper we present a variant of the popular Random Forest, Fuzzy Forests, which we argue is well suited for exactly this type of applied machine learning problem [6]. Fuzzy Forests are ideal for feature selection in large and high-dimensional datasets, where the features are highly correlated.

candidate choice, dataset, fuzzy forest, (15 more...)

2203.02818

Country:

North America > United States > California (0.04)
Asia > Middle East > Iran (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(3 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Government > Voting & Elections (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.78)

#artificialintelligenceMar-2-2022, 17:38:14 GMT

Random Forest Regression

A few weeks ago, I wrote an article demonstrating random forest classification models. In this article, we will demonstrate the regression case of random forest using sklearn's RandomForrestRegressor() model. Similarly to my last article, I will begin this article by highlighting some definitions and terms relating to and comprising the backbone of the random forest machine learning. The goal of this article is to describe the random forest model, and demonstrate how it can be applied using the sklearn package. Our goal will not be to solve for the most optimal solution as this is just a basic guide.

dataset, decision tree, random forest model, (9 more...)

Country: North America > United States (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Cotte, Philippe, Lagier, Pierre, Margot, Vincent, Geissler, Christophe

Making use of supercomputers in financial machine learning

arXiv.org Machine LearningMar-1-2022

This article is the result of a collaboration between Fujitsu and Advestis. This collaboration aims at refactoring and running an algorithm based on systematic exploration producing investment recommendations on a high-performance computer of the Fugaku type [11], to see whether a very high number of cores could allow for a deeper exploration of the data compared to a cloud machine, hopefully resulting in better predictions. We found that an increase in the number of explored rules results in a net increase in the predictive performance of the final ruleset. Also, in the particular case of this study, we found that using more than around 40 cores does not bring a significant computation time gain. However, the origin of this limitation is explained by a threshold-based search heuristic used to prune the search space. We have evidence that for similar data sets with less restrictive thresholds, the number of cores actually used could very well be much higher, allowing parallelization to have a much greater effect.

algorithm, length 1, ruleset, (16 more...)

2203.00427

Country:

Europe > France > Île-de-France > Paris > Paris (0.04)
Asia > Singapore (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)

Gopalan, Parikshit, Narodytska, Nina, Reingold, Omer, Sharan, Vatsal, Wieder, Udi

KL Divergence Estimation with Multi-group Attribution

arXiv.org Machine LearningFeb-28-2022

Estimating the Kullback-Leibler (KL) divergence between two distributions given samples from them is well-studied in machine learning and information theory. Motivated by considerations of multi-group fairness, we seek KL divergence estimates that accurately reflect the contributions of sub-populations to the overall divergence. We model the sub-populations coming from a rich (possibly infinite) family $\mathcal{C}$ of overlapping subsets of the domain. We propose the notion of multi-group attribution for $\mathcal{C}$, which requires that the estimated divergence conditioned on every sub-population in $\mathcal{C}$ satisfies some natural accuracy and fairness desiderata, such as ensuring that sub-populations where the model predicts significant divergence do diverge significantly in the two distributions. Our main technical contribution is to show that multi-group attribution can be derived from the recently introduced notion of multi-calibration for importance weights [HKRR18, GRSW21]. We provide experimental evidence to support our theoretical results, and show that multi-group attribution provides better KL divergence estimates when conditioned on sub-populations than other popular algorithms.

algorithm, divergence, multi-group attribution, (16 more...)

2202.13576

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Michigan (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.46)

#artificialintelligenceFeb-24-2022, 11:05:49 GMT

Visualizing Decision Trees with Pybaobabdt

Data visualization is the language of decision-making. Good charts effectively convey information. Decision trees can be visualized in multiple ways. Take, for instance, the indentation nodes where every internal and leaf node is depicted as text, while the parent-child relationship is shown by indenting the child with respect to the parent. Then there is the node-link diagram. It is one of the most commonly used methods to visualize decision trees where the nodes are represented via glyphs, and parent and child nodes are connected through links.

node, pybaobabdt, visualizing decision tree

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.90)

#artificialintelligenceFeb-21-2022, 08:15:54 GMT

Top resources to learn decision trees in 2022

Decision trees are a supervised learning method used to build a model that predicts the value of a target variable by learning simple decision rules from the data features. DTs are used for both classification and regression and are simple to understand and interpret. Below, we have listed down the top online courses, YouTube videos and guides for enthusiasts to master decision trees. The course by CodeAcademy focuses on teaching developers how to build and use decision trees and random forests. The course looks at two methods in detail: Gini impurity and Information Gain.

decision tree, decision tree algorithm, university, (13 more...)

Country: North America > United States > Pennsylvania (0.05)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Education > Educational Setting > Online (0.93)
Education > Educational Technology > Educational Software > Computer Based Training (0.36)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Kim, Jungtaek, Choi, Seungjin

On Uncertainty Estimation by Tree-based Surrogate Models in Sequential Model-based Optimization

arXiv.org Machine LearningFeb-21-2022

Sequential model-based optimization sequentially selects a candidate point by constructing a surrogate model with the history of evaluations, to solve a black-box optimization problem. Gaussian process (GP) regression is a popular choice as a surrogate model, because of its capability of calculating prediction uncertainty analytically. On the other hand, an ensemble of randomized trees is another option and has practical merits over GPs due to its scalability and easiness of handling continuous/discrete mixed variables. In this paper we revisit various ensembles of randomized trees to investigate their behavior in the perspective of prediction uncertainty estimation. Then, we propose a new way of constructing an ensemble of randomized trees, referred to as BwO forest, where bagging with oversampling is employed to construct bootstrapped samples that are used to build randomized trees with random splitting. Experimental results demonstrate the validity and good performance of BwO forest over existing tree-based models in various circumstances.

bwo forest, mondrian forest, surrogate model, (12 more...)

2202.10669

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States > Nevada (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(10 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)
(2 more...)