Goto

Collaborating Authors

 Decision Tree Learning


Early Forecasting of Text Classification Accuracy and F-Measure with Active Learning

arXiv.org Machine Learning

When creating text classification systems, one of the major bottlenecks is the annotation of training data. Active learning has been proposed to address this bottleneck using stopping methods to minimize the cost of data annotation. An important capability for improving the utility of stopping methods is to effectively forecast the performance of the text classification models. Forecasting can be done through the use of logarithmic models regressed on some portion of the data as learning is progressing. A critical unexplored question is what portion of the data is needed for accurate forecasting. There is a tension, where it is desirable to use less data so that the forecast can be made earlier, which is more useful, versus it being desirable to use more data, so that the forecast can be more accurate. We find that when using active learning it is even more important to generate forecasts earlier so as to make them more useful and not waste annotation effort. We investigate the difference in forecasting difficulty when using accuracy and F-measure as the text classification system performance metrics and we find that F-measure is more difficult to forecast. We conduct experiments on seven text classification datasets in different semantic domains with different characteristics and with three different base machine learning algorithms. We find that forecasting is easiest for decision tree learning, moderate for Support Vector Machines, and most difficult for neural networks.


A meta-algorithm for classification using random recursive tree ensembles: A high energy physics application

arXiv.org Machine Learning

The aim of this work is to propose a meta-algorithm for automatic classification in the presence of discrete binary classes. Classifier learning in the presence of overlapping class distributions is a challenging problem in machine learning. Overlapping classes are described by the presence of ambiguous areas in the feature space with a high density of points belonging to both classes. This often occurs in real-world datasets, one such example is numeric data denoting properties of particle decays derived from high-energy accelerators like the Large Hadron Collider (LHC). A significant body of research targeting the class overlap problem use ensemble classifiers to boost the performance of algorithms by using them iteratively in multiple stages or using multiple copies of the same model on different subsets of the input training data. The former is called boosting and the latter is called bagging. The algorithm proposed in this thesis targets a challenging classification problem in high energy physics - that of improving the statistical significance of the Higgs discovery. The underlying dataset used to train the algorithm is experimental data built from the official ATLAS full-detector simulation with Higgs events (signal) mixed with different background events (background) that closely mimic the statistical properties of the signal generating class overlap. The algorithm proposed is a variant of the classical boosted decision tree which is known to be one of the most successful analysis techniques in experimental physics. The algorithm utilizes a unified framework that combines two meta-learning techniques - bagging and boosting. The results show that this combination only works in the presence of a randomization trick in the base learners.


From local explanations to global understanding with explainable AI for trees

#artificialintelligence

Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model's performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains.


A survey on Machine Learning-based Performance Improvement of Wireless Networks: PHY, MAC and Network layer

arXiv.org Machine Learning

This paper provides a systematic and comprehensive survey that reviews the latest research efforts focused on machine learning (ML) based performance improvement of wireless networks, while considering all layers of the protocol stack (PHY, MAC and network). First, the related work and paper contributions are discussed, followed by providing the necessary background on data-driven approaches and machine learning for non-machine learning experts to understand all discussed techniques. Then, a comprehensive review is presented on works employing ML-based approaches to optimize the wireless communication parameters settings to achieve improved network quality-of-service (QoS) and quality-of-experience (QoE). We first categorize these works into: radio analysis, MAC analysis and network prediction approaches, followed by subcategories within each. Finally, open challenges and broader perspectives are discussed.


Decision Tree Algorithm, Explained - KDnuggets

#artificialintelligence

Classification is a two-step process, learning step and prediction step, in machine learning. In the learning step, the model is developed based on given training data. In the prediction step, the model is used to predict the response for given data. Decision Tree is one of the easiest and popular classification algorithms to understand and interpret. Decision Tree algorithm belongs to the family of supervised learning algorithms.


Cyber Attack Detection thanks to Machine Learning Algorithms

arXiv.org Machine Learning

Cybersecurity attacks are growing both in frequency and sophistication over the years. This increasing sophistication and complexity call for more advancement and continuous innovation in defensive strategies. Traditional methods of intrusion detection and deep packet inspection, while still largely used and recommended, are no longer sufficient to meet the demands of growing security threats. As computing power increases and cost drops, Machine Learning is seen as an alternative method or an additional mechanism to defend against malwares, botnets, and other attacks. This paper explores Machine Learning as a viable solution by examining its capabilities to classify malicious traffic in a network. First, a strong data analysis is performed resulting in 22 extracted features from the initial Netflow datasets. All these features are then compared with one another through a feature selection process. Then, our approach analyzes five different machine learning algorithms against NetFlow dataset containing common botnets. The Random Forest Classifier succeeds in detecting more than 95% of the botnets in 8 out of 13 scenarios and more than 55% in the most difficult datasets. Finally, insight is given to improve and generalize the results, especially through a bootstrapping technique.


Extracting more from boosted decision trees: A high energy physics case study

arXiv.org Machine Learning

Particle identification is one of the core tasks in the data analysis pipeline at the Large Hadron Collider (LHC). Statistically, this entails the identification of rare signal events buried in immense backgrounds that mimic the properties of the former. In machine learning parlance, particle identification represents a classification problem characterized by overlapping and imbalanced classes. Boosted decision trees (BDTs) have had tremendous success in the particle identification domain but more recently have been overshadowed by deep learning (DNNs) approaches. This work proposes an algorithm to extract more out of standard boosted decision trees by targeting their main weakness, susceptibility to overfitting. This novel construction harnesses the meta-learning techniques of boosting and bagging simultaneously and performs remarkably well on the ATLAS Higgs (H) to tau-tau data set (ATLAS et al., 2014) which was the subject of the 2014 Higgs ML Challenge (Adam-Bourdarios et al., 2015). While the decay of Higgs to a pair of tau leptons was established in 2018 (CMS collaboration et al., 2017) at the 4.9$\sigma$ significance based on the 2016 data taking period, the 2014 public data set continues to serve as a benchmark data set to test the performance of supervised classification schemes. We show that the score achieved by the proposed algorithm is very close to the published winning score which leverages an ensemble of deep neural networks (DNNs). Although this paper focuses on a single application, it is expected that this simple and robust technique will find wider applications in high energy physics.


Coronary Artery Disease Diagnosis; Ranking the Significant Features Using Random Trees Model

arXiv.org Machine Learning

Since data collection and analysis are difficult, time consuming and costly, we are always looking for a way to optimum use of data to achieve the correct decision that can be referred to diagnose and experiment of diseases in healthcare organizations [3]. In addition, common method such as angiography [5,6] in experimenting and diagnosing diseases is costly and have adverse effects for patients as healthcare resear chers are trying to utilize methods that avoid the high cost as well as the adverse effects of previous methods, which can be performed by using computer - aided disease diagnose methods means machine learning. Whereas, da ta mining process by utilizing machine learning science and database management knowledge [1] has become a robust tool for data analysis and management of health industry data which ultimately leads to knowledge extraction. It should be noted that, with the progress of technology in t he healthcare especially, healthcare industry 4.0, human lifetime has become progressive and more comfortable [ 7 ] . In this new generation, with the development of new medical devices, equipment and tools, new knowledge can be gained in the field of disease diagnosis.


Machine learning for total cloud cover prediction

arXiv.org Machine Learning

Accurate and reliable forecasting of total cloud cover (TCC) is vital for many areas such as astronomy, energy demand and production, or agriculture. Most meteorological centres issue ensemble forecasts of TCC, however, these forecasts are often uncalibrated and exhibit worse forecast skill than ensemble forecasts of other weather variables. Hence, some form of post-processing is strongly required to improve predictive performance. As TCC observations are usually reported on a discrete scale taking just nine different values called oktas, statistical calibration of TCC ensemble forecasts can be considered a classification problem with outputs given by the probabilities of the oktas. This is a classical area where machine learning methods are applied. We investigate the performance of post-processing using multilayer percep-tron (MLP) neural networks, gradient boosting machines (GBM) and random forest (RF) methods. Based on the European Centre for Medium-Range Weather Forecasts global TCC ensemble forecasts for 2002-2014 we compare these approaches with the proportional odds logistic regression (POLR) and multiclass logistic regression (MLR) models, as well as the raw TCC ensemble forecasts. We further assess whether improvements in forecast skill can be obtained by incorporating ensemble forecasts of precipitation as additional predictor. Compared to the raw ensemble, all calibration methods result in a significant improvement in forecast skill. RF models provide the smallest increase in predictive performance, while MLP, POLR and GBM approaches perform best. Key words: ensemble calibration; gradient boosting machine; logistic regression; mul-tilayer perceptron; random forest; total cloud cover 1 Introduction Reliable and accurate prediction of total cloud cover (TCC) has a principal importance in observational astronomy (Ye and Chen, 2013) and in the prediction of photovoltaic energy production, as it is the main cause of variation in solar-radiation energy supply (Matuszko, 2012; McEvoy et al., 2012), but it is also of great relevance in agriculture, tourism and in some other fields of economy.


Understanding Decision Tree Classification with Scikit-Learn

#artificialintelligence

Gini Impurity is named after the Italian statistician Corrado Gini. Gini impurity can be understood as a criterion to minimize the probability of misclassification. To understand the definition (as shown in the figure) and exactly how we can build up a decision tree, let's get started with a very simple data-set, where depending on various weather conditions, we decide whether to play an outdoor game or not. From the definition, a data-set containing only one class will have 0 Gini Impurity. In building up the decision tree our idea is to choose the feature with least Gini Impurity as root node and so on... Let's get started with the simple data-set -- Here we see that depending on 4 features (Outlook, Temperature, Humidity, Wind), decision is made on whether to play tennis or not.