Goto

Collaborating Authors

 Decision Tree Learning


Evolutionary algorithms for constructing an ensemble of decision trees

arXiv.org Machine Learning

Most decision tree induction algorithms are based on a greedy top-down recursive partitioning strategy for tree growth. In this paper, we propose several methods for induction of decision trees and their ensembles based on evolutionary algorithms. The main difference of our approach is using real-valued vector representation of decision tree that allows to use a large number of different optimization algorithms, as well as optimize the whole tree or ensemble for avoiding local optima. Differential evolution and evolution strategies were chosen as optimization algorithms, as they have good results in reinforcement learning problems. We test the predictive performance of this methods using several public UCI data sets, and the proposed methods show better quality than classical methods.


Boosted and Differentially Private Ensembles of Decision Trees

arXiv.org Machine Learning

Boosted ensemble of decision tree (DT) classifiers are extremely popular in international competitions, yet to our knowledge nothing is formally known on how to make them \textit{also} differential private (DP), up to the point that random forests currently reign supreme in the DP stage. Our paper starts with the proof that the privacy vs boosting picture for DT involves a notable and general technical tradeoff: the sensitivity tends to increase with the boosting rate of the loss, for any proper loss. DT induction algorithms being fundamentally iterative, our finding implies non-trivial choices to select or tune the loss to balance noise against utility to split nodes. To address this, we craft a new parametererized proper loss, called the M$\alpha$-loss, which, as we show, allows to finely tune the tradeoff in the complete spectrum of sensitivity vs boosting guarantees. We then introduce \textit{objective calibration} as a method to adaptively tune the tradeoff during DT induction to limit the privacy budget spent while formally being able to keep boosting-compliant convergence on limited-depth nodes with high probability. Extensive experiments on 19 UCI domains reveal that objective calibration is highly competitive, even in the DP-free setting. Our approach tends to very significantly beat random forests, in particular on high DP regimes ($\varepsilon \leq 0.1$) and even with boosted ensembles containing ten times less trees, which could be crucial to keep a key feature of DT models under differential privacy: interpretability.


A random forest based approach for predicting spreads in the primary catastrophe bond market

arXiv.org Machine Learning

We introduce a random forest approach to enable spreads' prediction in the primary catastrophe bond market. We investigate whether all information provided to investors in the offering circular prior to a new issuance is equally important in predicting its spread. The whole population of non-life catastrophe bonds issued from December 2009 to May 2018 is used. The random forest shows an impressive predictive power on unseen primary catastrophe bond data explaining 93% of the total variability. For comparison, linear regression, our benchmark model, has inferior predictive performance explaining only 47% of the total variability. All details provided in the offering circular are predictive of spread but in a varying degree. The stability of the results is studied. The usage of random forest can speed up investment decisions in the catastrophe bond industry.


WISDoM: a framework for the Analysis of Wishart distributed matrices

arXiv.org Machine Learning

APPENDIX A. Visualizing the Wishart Distribution The Wishart distribution is a generalization to multiple dimensions of the chi-squared distribution, or in the case of non-integer degrees of freedom, of the gamma distribution. We show in fig.5 that for a 1-dimensional and equal to 1 Σ scale matrix, the Wishart distribution W 1( n, 1) is equivalent to the χ 2 ( n) distribution. Figure 5: Monodimensional Wishart Distribution and χ 2 (n) distribution comparison Save for this simple case, being the Wishart a distribution over matrices, it is a generally hard task to visualize it as a density function. Samples can be however drawn from it and the eigenvectors and eigenvalues of the resulting sampled matrix can be exploited to define an ellipse. An example of this technique is shown in fig.6.


Survey of Network Intrusion Detection Methods from the Perspective of the Knowledge Discovery in Databases Process

arXiv.org Artificial Intelligence

The identification of cyberattacks which target information and communication systems has been a focus of the research community for years. Network intrusion detection is a complex problem which presents a diverse number of challenges. Many attacks currently remain undetected, while newer ones emerge due to the proliferation of connected devices and the evolution of communication technology. In this survey, we review the methods that have been applied to network data with the purpose of developing an intrusion detector, but contrary to previous reviews in the area, we analyze them from the perspective of the Knowledge Discovery in Databases (KDD) process. As such, we discuss the techniques used for the capture, preparation and transformation of the data, as well as, the data mining and evaluation methods. In addition, we also present the characteristics and motivations behind the use of each of these techniques and propose more adequate and up-to-date taxonomies and definitions for intrusion detectors based on the terminology used in the area of data mining and KDD. Special importance is given to the evaluation procedures followed to assess the different detectors, discussing their applicability in current real networks. Finally, as a result of this literature review, we investigate some open issues which will need to be considered for further research in the area of network security.


Estimating heterogeneous treatment effects with right-censored data via causal survival forests

arXiv.org Machine Learning

There is fast-growing literature on estimating heterogeneous treatment effects via random forests in observational studies. However, there are few approaches available for right-censored survival data. In clinical trials, right-censored survival data are frequently encountered. Quantifying the causal relationship between a treatment and the survival outcome is of great interest. Random forests provide a robust, nonparametric approach to statistical estimation. In addition, recent developments allow forest-based methods to quantify the uncertainty of the estimated heterogeneous treatment effects. We propose causal survival forests that directly target on estimating the treatment effect from an observational study. We establish consistency and asymptotic normality of the proposed estimators and provide an estimator of the asymptotic variance that enables valid confidence intervals of the estimated treatment effect. The performance of our approach is demonstrated via extensive simulations and data from an HIV study.


Estimation of high frequency nutrient concentrations from water quality surrogates using machine learning methods

arXiv.org Machine Learning

A bstract Continuous high frequency water quality monitoring is becoming a critical task to support water management. Despite the advancement s in sensor technologies, certain variables cannot be easily and/or economically monitored in - situ and in real time. In these cases, surrogate measures can be used to make estimations by means of data - driven models. In th is work, variables that are commonly measured in - situ are used as surrogates to estimate the concentration s of nutrients in a rural catchment and in an urban one, making use of machine learning models, specifically Random Forests . The results are compared with those of linear modelling using the same number of surrogates, obtaining a reduction in the Root Mean Squared Error (RMSE) of up to 60.1% . Th e profit from including up to seven surrogate sensors was computed, concluding that adding more than 4 and 5 sensors in each of the catchments respectively was not worthy in terms of error improvement. Keywords water monitoring, water quality, surrogate parameters, random forests, soft - sensors, machine learning 2 1. Introduction Waterb odies must maintain a good ecological and chemical status in order to protect human health, preserve water supply and safeguard natural ecosystems and biodiversity. The assessment of the ecological status of these waterbodies in a coherent and comprehensiv e way would benefit from improving water quality monitoring progra mmes (Voulvoulis et al., 2017) . To date, many substa nces like major nutrients (nitrogen (N) and phosphorus (P)) are mostly monitored by means of analytical discrete campaigns with low sampling frequenc y . Nutrient monitoring is of great importance to reduce the risk of eutrophication, a water quality problem that leads to numerous negative impacts like public health issues, fish mortality and unhealthy ecosystems, among others .


Investigating Classification Techniques with Feature Selection For Intention Mining From Twitter Feed

arXiv.org Artificial Intelligence

In the last decade, social networks became most popular medium for communication and interaction. As an example, micro-blogging service Twitter has more than 200 million registered users who exchange more than 65 million posts per day. Users express their thoughts, ideas, and even their intentions through these tweets. Most of the tweets are written informally and often in slang language, that contains misspelt and abbreviated words. This paper investigates the problem of selecting features that affect extracting user's intention from Twitter feeds based on text mining techniques. It starts by presenting the method we used to construct our own dataset from extracted Twitter feeds. Following that, we present two techniques of feature selection followed by classification. In the first technique, we use Information Gain as a one-phase feature selection, followed by supervised classification algorithms. In the second technique, we use a hybrid approach based on forward feature selection algorithm in which two feature selection techniques employed followed by classification algorithms. We examine these two techniques with four classification algorithms. We evaluate them using our own dataset, and we critically review the results.


Time series forecasting with random forest

#artificialintelligence

Benjamin Franklin said that only two things are certain in life: death and taxes. That explains why my colleagues at STATWORX were less than excited when they told me about their plans for the weekend a few weeks back: doing their income tax declaration. Man, I thought, that sucks, I'd rather spend this time outdoors. And then an idea was born. What could taxes and the outdoors possibly have in common?


Improving Label Ranking Ensembles using Boosting Techniques

arXiv.org Machine Learning

Label ranking is a prediction task which deals with learning a mapping between an instance and a ranking (i.e., order) of labels from a finite set, representing their relevance to the instance. Boosting is a well-known and reliable ensemble technique that was shown to often outperform other learning algorithms. While boosting algorithms were developed for a multitude of machine learning tasks, label ranking tasks were overlooked. In this paper, we propose a boosting algorithm which was specifically designed for label ranking tasks. Extensive evaluation of the proposed algorithm on 24 semi-synthetic and real-world label ranking datasets shows that it significantly outperforms existing state-of-the-art label ranking algorithms.