Goto

Collaborating Authors

 Decision Tree Learning


8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

#artificialintelligence

Has this happened to you? You are working on your dataset. You create a classification model and get 90% accuracy immediately. You dive a little deeper and discover that 90% of the data belongs to one class. This is an example of an imbalanced dataset and the frustrating results it can cause.


On the usage of the probability integral transform to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems

arXiv.org Machine Learning

We present a new distributed fuzzy partitioning method to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems. The proposed algorithm builds a fixed number of fuzzy sets for all variables and adjusts their shape and position to the real distribution of training data. A two-step process is applied : 1) transformation of the original distribution into a standard uniform distribution by means of the probability integral transform. Since the original distribution is generally unknown, the cumulative distribution function is approximated by computing the q-quantiles of the training set; 2) construction of a Ruspini strong fuzzy partition in the transformed attribute space using a fixed number of equally distributed triangular membership functions. Despite the aforementioned transformation, the definition of every fuzzy set in the original space can be recovered by applying the inverse cumulative distribution function (also known as quantile function). The experimental results reveal that the proposed methodology allows the state-of-the-art multi-way fuzzy decision tree (FMDT) induction algorithm to maintain classification accuracy with up to 6 million fewer leaves.


Improving fraud prediction with incremental data balancing technique for massive data streams

arXiv.org Machine Learning

The performance of classification algorithms with a massive and highly imbalanced data stream depends upon efficient balancing strategy. Some techniques of balancing strategy have been applied in the past with Batch data to resolve the class imbalance problem. This paper proposes a new incremental data balancing framework which can work with massive imbalanced data streams. In this paper, we choose Racing Algorithm as an automated data balancing technique which optimizes the balancing techniques. We applied Random Forest classification algorithm which can deal with the massive data stream. We investigated the suitability of Racing Algorithm and Random Forest in the proposed framework. Applying new technique in the proposed framework on the European Credit Card dataset, provided better results than the Batch mode. The proposed framework is more scalable to handle online massive data streams.


Robust Decision Trees Against Adversarial Examples

arXiv.org Machine Learning

Although adversarial examples and model robustness have been extensively studied in the context of linear models and neural networks, research on this issue in tree-based models and how to make tree-based models robust against adversarial examples is still limited. In this paper, we show that tree based models are also vulnerable to adversarial examples and develop a novel algorithm to learn robust trees. At its core, our method aims to optimize the performance under the worst-case perturbation of input features, which leads to a max-min saddle point problem. Incorporating this saddle point objective into the decision tree building procedure is non-trivial due to the discrete nature of trees --- a naive approach to finding the best split according to this saddle point objective will take exponential time. To make our approach practical and scalable, we propose efficient tree building algorithms by approximating the inner minimizer in this saddle point problem, and present efficient implementations for classical information gain based trees as well as state-of-the-art tree boosting models such as XGBoost. Experimental results on real world datasets demonstrate that the proposed algorithms can substantially improve the robustness of tree-based models against adversarial examples.


Neural Packet Classification

arXiv.org Artificial Intelligence

Packet classification is a fundamental problem in computer networking. This problem exposes a hard tradeoff between the computation and state complexity, which makes it particularly challenging. To navigate this tradeoff, existing solutions rely on complex hand-tuned heuristics, which are brittle and hard to optimize. In this paper, we propose a deep reinforcement learning (RL) approach to solve the packet classification problem. There are several characteristics that make this problem a good fit for Deep RL. First, many of the existing solutions are iteratively building a decision tree by splitting nodes in the tree. Second, the effects of these actions (e.g., splitting nodes) can only be evaluated once we are done with building the tree. These two characteristics are naturally captured by the ability of RL to take actions that have sparse and delayed rewards. Third, it is computationally efficient to generate data traces and evaluate decision trees, which alleviate the notoriously high sample complexity problem of Deep RL algorithms. Our solution, NeuroCuts, uses succinct representations to encode state and action space, and efficiently explore candidate decision trees to optimize for a global objective. It produces compact decision trees optimized for a specific set of rules and a given performance metric, such as classification time, memory footprint, or a combination of the two. Evaluation on ClassBench shows that NeuroCuts outperforms existing hand-crafted algorithms in classification time by 18% at the median, and reduces both time and memory footprint by up to 3x.


Entity Personalized Talent Search Models with Tree Interaction Features

arXiv.org Artificial Intelligence

Talent Search systems aim to recommend potential candidates who are a good match to the hiring needs of a recruiter expressed in terms of the recruiter's search query or job posting. Past work in this domain has focused on linear and nonlinear models which lack preference personalization in the user-level due to being trained only with globally collected recruiter activity data. In this paper, we propose an entity-personalized Talent Search model which utilizes a combination of generalized linear mixed (GLMix) models and gradient boosted decision tree (GBDT) models, and provides personalized talent recommendations using nonlinear tree interaction features generated by the GBDT. We also present the offline and online system architecture for the productionization of this hybrid model approach in our Talent Search systems. Finally, we provide offline and online experiment results benchmarking our entity-personalized model with tree interaction features, which demonstrate significant improvements in our precision metrics compared to globally trained non-personalized models.



A Quick look at ML in algorithmic trading strategies Packt Hub

#artificialintelligence

Algorithmic trading relies on computer programs that execute algorithms to automate some, or all, elements of a trading strategy. Algorithms are a sequence of steps or rules to achieve a goal and can take many forms. In the case of machine learning (ML), algorithms pursue the objective of learning other algorithms, namely rules, to achieve a target based on data, such as minimizing a prediction error. In this article, we have a look at use cases of ML and how it is used in algorithmic trading strategies. These algorithms encode various activities of a portfolio manager who observes market transactions and analyzes relevant data to decide on placing buy or sell orders.


Diversity of Ensembles for Data Stream Classification

arXiv.org Machine Learning

When constructing a classifier ensemble, diversity among the base classifiers is one of the important characteristics. Several studies have been made in the context of standard static data, in particular, when analyzing the relationship between a high ensemble predictive performance and the diversity of its components. Besides, ensembles of learning machines have been performed to learn in the presence of concept drift and adapt to it. However, diversity measures have not received much research interest in evolving data streams. Only a few researchers directly consider promoting diversity while constructing an ensemble or rebuilding them in the moment of detecting drifts. In this paper, we present a theoretical analysis of different diversity measures and relate them to the success of ensemble learning algorithms for streaming data. The analysis provides a deeper understanding of the concept of diversity and its impact on online ensemble Learning in the presence of concept drift. More precisely, we are interested in answering the following research question; Which commonly used diversity measures are used in the context of static-data ensembles and how far are they applicable in the context of streaming data ensembles?


Random Forest Algorithm in Machine Learning

#artificialintelligence

Random forest algorithm is a one of the most popular and most powerful supervised Machine Learning algorithm in Machine Learning that is capable of performing both regression and classification tasks. As the name suggest, this algorithm creates the forest with a number of decision trees. Random Forest Algorithm in Machine Learning: Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions. Machine learning is closely related to and often overlaps with computational statistics; a discipline that also specializes in prediction-making.