AITopics | Decision Tree Learning

Collaborating Authors

Decision Tree Learning

Learning to Classify with Branching Tests: "A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no decision. Decision trees therefore represent Boolean functions. Functions with a larger range of outputs can also be represented...."
– Artificial Intelligence: A Modern Approach. By Stuart Russell & Peter Norvig. 2002. Section 18.3; page 531.

News Overviews Instructional Materials AI-Alerts Classics

Evolutionary algorithms for constructing an ensemble of decision trees

Dolotov, Evgeny, Zolotykh, Nikolai

arXiv.org Machine LearningFeb-3-2020

Most decision tree induction algorithms are based on a greedy top-down recursive partitioning strategy for tree growth. In this paper, we propose several methods for induction of decision trees and their ensembles based on evolutionary algorithms. The main difference of our approach is using real-valued vector representation of decision tree that allows to use a large number of different optimization algorithms, as well as optimize the whole tree or ensemble for avoiding local optima. Differential evolution and evolution strategies were chosen as optimization algorithms, as they have good results in reinforcement learning problems. We test the predictive performance of this methods using several public UCI data sets, and the proposed methods show better quality than classical methods.

algorithm, decision tree, ensemble, (15 more...)

arXiv.org Machine Learning

2002.00721

Country:

Asia > Russia (0.05)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)

Genre: Research Report (0.40)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Boosted and Differentially Private Ensembles of Decision Trees

Nock, Richard, Henecka, Wilko

arXiv.org Machine LearningFeb-3-2020

Boosted ensemble of decision tree (DT) classifiers are extremely popular in international competitions, yet to our knowledge nothing is formally known on how to make them \textit{also} differential private (DP), up to the point that random forests currently reign supreme in the DP stage. Our paper starts with the proof that the privacy vs boosting picture for DT involves a notable and general technical tradeoff: the sensitivity tends to increase with the boosting rate of the loss, for any proper loss. DT induction algorithms being fundamentally iterative, our finding implies non-trivial choices to select or tune the loss to balance noise against utility to split nodes. To address this, we craft a new parametererized proper loss, called the M$\alpha$-loss, which, as we show, allows to finely tune the tradeoff in the complete spectrum of sensitivity vs boosting guarantees. We then introduce \textit{objective calibration} as a method to adaptively tune the tradeoff during DT induction to limit the privacy budget spent while formally being able to keep boosting-compliant convergence on limited-depth nodes with high probability. Extensive experiments on 19 UCI domains reveal that objective calibration is highly competitive, even in the DP-free setting. Our approach tends to very significantly beat random forests, in particular on high DP regimes ($\varepsilon \leq 0.1$) and even with boosted ensembles containing ten times less trees, which could be crucial to keep a key feature of DT models under differential privacy: interpretability.

budget, kearn & mansour, mechanism, (15 more...)

arXiv.org Machine Learning

2001.09384

Country:

North America > United States > Pennsylvania (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

A random forest based approach for predicting spreads in the primary catastrophe bond market

Makariou, Despoina, Barrieu, Pauline, Chen, Yining

arXiv.org Machine LearningJan-28-2020

We introduce a random forest approach to enable spreads' prediction in the primary catastrophe bond market. We investigate whether all information provided to investors in the offering circular prior to a new issuance is equally important in predicting its spread. The whole population of non-life catastrophe bonds issued from December 2009 to May 2018 is used. The random forest shows an impressive predictive power on unseen primary catastrophe bond data explaining 93% of the total variability. For comparison, linear regression, our benchmark model, has inferior predictive performance explaining only 47% of the total variability. All details provided in the offering circular are predictive of spread but in a varying degree. The stability of the results is studied. The usage of random forest can speed up investment decisions in the catastrophe bond industry.

catastrophe bond, predictor, random forest, (15 more...)

arXiv.org Machine Learning

2001.10393

Country:

Europe (0.04)
South America (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(3 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Banking & Finance > Trading (1.00)
Banking & Finance > Insurance (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)

Add feedback

WISDoM: a framework for the Analysis of Wishart distributed matrices

Mengucci, Carlo, Remondini, Daniel, Giampieri, Enrico

arXiv.org Machine LearningJan-28-2020

APPENDIX A. Visualizing the Wishart Distribution The Wishart distribution is a generalization to multiple dimensions of the chi-squared distribution, or in the case of non-integer degrees of freedom, of the gamma distribution. We show in fig.5 that for a 1-dimensional and equal to 1 Σ scale matrix, the Wishart distribution W 1( n, 1) is equivalent to the χ 2 ( n) distribution. Figure 5: Monodimensional Wishart Distribution and χ 2 (n) distribution comparison Save for this simple case, being the Wishart a distribution over matrices, it is a generally hard task to visualize it as a density function. Samples can be however drawn from it and the eigenvectors and eigenvalues of the resulting sampled matrix can be exploited to define an ellipse. An example of this technique is shown in fig.6.

classification, matrix, wishart distribution, (14 more...)

arXiv.org Machine Learning

2001.10342

Country:

Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.05)
Asia > Taiwan (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.94)

Add feedback

Survey of Network Intrusion Detection Methods from the Perspective of the Knowledge Discovery in Databases Process

Molina-Coronado, Borja, Mori, Usue, Mendiburu, Alexander, Miguel-Alonso, José

arXiv.org Artificial IntelligenceJan-27-2020

The identification of cyberattacks which target information and communication systems has been a focus of the research community for years. Network intrusion detection is a complex problem which presents a diverse number of challenges. Many attacks currently remain undetected, while newer ones emerge due to the proliferation of connected devices and the evolution of communication technology. In this survey, we review the methods that have been applied to network data with the purpose of developing an intrusion detector, but contrary to previous reviews in the area, we analyze them from the perspective of the Knowledge Discovery in Databases (KDD) process. As such, we discuss the techniques used for the capture, preparation and transformation of the data, as well as, the data mining and evaluation methods. In addition, we also present the characteristics and motivations behind the use of each of these techniques and propose more adequate and up-to-date taxonomies and definitions for intrusion detectors based on the terminology used in the area of data mining and KDD. Special importance is given to the evaluation procedures followed to assess the different detectors, discussing their applicability in current real networks. Finally, as a result of this literature review, we investigate some open issues which will need to be considered for further research in the area of network security.

data mining, detection, machine learning, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TNSM.2020.3016246

2001.09697

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Spain > Basque Country (0.04)
Oceania > New Zealand > North Island > Waikato (0.04)
(8 more...)

Genre: Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Information Technology > Networks (1.00)
Government > Military > Cyberwarfare (0.34)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
(4 more...)

Add feedback

Estimating heterogeneous treatment effects with right-censored data via causal survival forests

Cui, Yifan, Kosorok, Michael R., Wager, Stefan, Zhu, Ruoqing

arXiv.org Machine LearningJan-27-2020

There is fast-growing literature on estimating heterogeneous treatment effects via random forests in observational studies. However, there are few approaches available for right-censored survival data. In clinical trials, right-censored survival data are frequently encountered. Quantifying the causal relationship between a treatment and the survival outcome is of great interest. Random forests provide a robust, nonparametric approach to statistical estimation. In addition, recent developments allow forest-based methods to quantify the uncertainty of the estimated heterogeneous treatment effects. We propose causal survival forests that directly target on estimating the treatment effect from an observational study. We establish consistency and asymptotic normality of the proposed estimators and provide an estimator of the asymptotic variance that enables valid confidence intervals of the estimated treatment effect. The performance of our approach is demonstrated via extensive simulations and data from an HIV study.

heterogeneous treatment effect, survival forest, treatment effect, (15 more...)

arXiv.org Machine Learning

2001.09887

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Research Report > Strength High (0.94)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Health & Medicine > Therapeutic Area > Immunology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Estimation of high frequency nutrient concentrations from water quality surrogates using machine learning methods

Castrillo, María, García, Álvaro López

arXiv.org Machine LearningJan-27-2020

A bstract Continuous high frequency water quality monitoring is becoming a critical task to support water management. Despite the advancement s in sensor technologies, certain variables cannot be easily and/or economically monitored in - situ and in real time. In these cases, surrogate measures can be used to make estimations by means of data - driven models. In th is work, variables that are commonly measured in - situ are used as surrogates to estimate the concentration s of nutrients in a rural catchment and in an urban one, making use of machine learning models, specifically Random Forests . The results are compared with those of linear modelling using the same number of surrogates, obtaining a reduction in the Root Mean Squared Error (RMSE) of up to 60.1% . Th e profit from including up to seven surrogate sensors was computed, concluding that adding more than 4 and 5 sensors in each of the catchments respectively was not worthy in terms of error improvement. Keywords water monitoring, water quality, surrogate parameters, random forests, soft - sensors, machine learning 2 1. Introduction Waterb odies must maintain a good ecological and chemical status in order to protect human health, preserve water supply and safeguard natural ecosystems and biodiversity. The assessment of the ecological status of these waterbodies in a coherent and comprehensiv e way would benefit from improving water quality monitoring progra mmes (Voulvoulis et al., 2017) . To date, many substa nces like major nutrients (nitrogen (N) and phosphorus (P)) are mostly monitored by means of analytical discrete campaigns with low sampling frequenc y . Nutrient monitoring is of great importance to reduce the risk of eutrophication, a water quality problem that leads to numerous negative impacts like public health issues, fish mortality and unhealthy ecosystems, among others .

predictor, river enborne, sensor, (16 more...)

arXiv.org Machine Learning

doi: 10.1016/j.watres.2020.115490

2001.09695

Country:

Europe > United Kingdom > England (0.04)
North America > United States > Iowa (0.04)
North America > Costa Rica > Heredia Province > Heredia (0.04)
(2 more...)

Genre:

Research Report (0.64)
Overview (0.46)

Industry: Water & Waste Management > Water Management > Water Supplies & Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Investigating Classification Techniques with Feature Selection For Intention Mining From Twitter Feed

Mishael, Qadri, Ayesh, Aladdin

arXiv.org Artificial IntelligenceJan-22-2020

In the last decade, social networks became most popular medium for communication and interaction. As an example, micro-blogging service Twitter has more than 200 million registered users who exchange more than 65 million posts per day. Users express their thoughts, ideas, and even their intentions through these tweets. Most of the tweets are written informally and often in slang language, that contains misspelt and abbreviated words. This paper investigates the problem of selecting features that affect extracting user's intention from Twitter feeds based on text mining techniques. It starts by presenting the method we used to construct our own dataset from extracted Twitter feeds. Following that, we present two techniques of feature selection followed by classification. In the first technique, we use Information Gain as a one-phase feature selection, followed by supervised classification algorithms. In the second technique, we use a hybrid approach based on forward feature selection algorithm in which two feature selection techniques employed followed by classification algorithms. We examine these two techniques with four classification algorithms. We evaluate them using our own dataset, and we critically review the results.

algorithm, feature selection, intention, (13 more...)

arXiv.org Artificial Intelligence

2001.1038

Country:

Oceania > New Zealand > North Island > Waikato (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
(3 more...)

Add feedback

Time series forecasting with random forest

#artificialintelligenceJan-21-2020, 19:02:31 GMT

Benjamin Franklin said that only two things are certain in life: death and taxes. That explains why my colleagues at STATWORX were less than excited when they told me about their plans for the weekend a few weeks back: doing their income tax declaration. Man, I thought, that sucks, I'd rather spend this time outdoors. And then an idea was born. What could taxes and the outdoors possibly have in common?

forecasting, random forest, time sery, (11 more...)

#artificialintelligence

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.06)
Europe > Switzerland > Zürich > Zürich (0.05)
Europe > Austria > Vienna (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.60)

Add feedback

Improving Label Ranking Ensembles using Boosting Techniques

Dery, Lihi, Shmueli, Erez

arXiv.org Machine LearningJan-21-2020

Label ranking is a prediction task which deals with learning a mapping between an instance and a ranking (i.e., order) of labels from a finite set, representing their relevance to the instance. Boosting is a well-known and reliable ensemble technique that was shown to often outperform other learning algorithms. While boosting algorithms were developed for a multitude of machine learning tasks, label ranking tasks were overlooked. In this paper, we propose a boosting algorithm which was specifically designed for label ranking tasks. Extensive evaluation of the proposed algorithm on 24 semi-synthetic and real-world label ranking datasets shows that it significantly outperforms existing state-of-the-art label ranking algorithms.

adaboost, algorithm, dataset, (11 more...)

arXiv.org Machine Learning

2001.07744

Country:

Europe > Germany (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.48)

Add feedback