Goto

Collaborating Authors

RADE: Resource-Efficient Supervised Anomaly Detection Using Decision Tree-Based Ensemble Methods

arXiv.org Machine Learning

Decision-tree-based ensemble classification methods (DTEMs) are a prevalent tool for supervised anomaly detection. However, due to the continued growth of datasets, DTEMs result in increasing drawbacks such as growing memory footprints, longer training times, and slower classification latencies at lower throughput. In this paper, we present, design, and evaluate RADE - a DTEM-based anomaly detection framework that augments standard DTEM classifiers and alleviates these drawbacks by relying on two observations: (1) we find that a small (coarse-grained) DTEM model is sufficient to classify the majority of the classification queries correctly, such that a classification is valid only if its corresponding confidence level is greater than or equal to a predetermined classification confidence threshold; (2) we find that in these fewer harder cases where our coarse-grained DTEM model results in insufficient confidence in its classification, we can improve it by forwarding the classification query to one of expert DTEM (fine-grained) models, which is explicitly trained for that particular case. We implement RADE in Python based on scikit-learn and evaluate it over different DTEM methods: RF, XGBoost, AdaBoost, GBDT and LightGBM, and over three publicly available datasets. Our evaluation over both a strong AWS EC2 instance and a Raspberry Pi 3 device indicates that RADE offers competitive and often superior anomaly detection capabilities as compared to standard DTEM methods, while significantly improving memory footprint (by up to 5.46x), training-time (by up to 17.2x), and classification latency (by up to 31.2x).


Stacked Generalizations in Imbalanced Fraud Data Sets using Resampling Methods

arXiv.org Machine Learning

This study uses stacked generalization, which is a two-step process of combining machine learning methods, called meta or super learners, for improving the performance of algorithms in step one (by minimizing the error rate of each individual algorithm to reduce its bias in the learning set) and then in step two inputting the results into the meta learner with its stacked blended output (demonstrating improved performance with the weakest algorithms learning better). The method is essentially an enhanced cross-validation strategy. Although the process uses great computational resources, the resulting performance metrics on resampled fraud data show that increased system cost can be justified. A fundamental key to fraud data is that it is inherently not systematic and, as of yet, the optimal resampling methodology has not been identified. Building a test harness that accounts for all permutations of algorithm sample set pairs demonstrates that the complex, intrinsic data structures are all thoroughly tested. Using a comparative analysis on fraud data that applies stacked generalizations provides useful insight needed to find the optimal mathematical formula to be used for imbalanced fraud data sets.


Comparing Different Classification Machine Learning Models for an imbalanced dataset

#artificialintelligence

A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class. The common example of such dataset is credit card fraud detection, where data points for fraud 1, are usually very less in comparison to fraud 0. There are many reasons why a dataset might be imbalanced: the category one is targeting might be very rare in the population, or the data might simply be difficult to collect. Let's solve the problem of an imbalanced dataset by working on one such dataset.


On Education Decision Trees, Random Forests, AdaBoost & XGBoost in Python - all courses

#artificialintelligence

Get a solid understanding of decision tree Understand the business scenarios where decision tree is applicable Tune a machine learning model's hyperparameters and evaluate its performance. Use Pandas DataFrames to manipulate data and make statistical computations. Use decision trees to make predictions Learn the advantage and disadvantages of the different algorithms Students will need to install Python and Anaconda software but we have a separate lecture to help you install the same You're looking for a complete Decision tree course that teaches you everything you need to create a Decision tree/ Random Forest/ XGBoost model in Python, right? You've found the right Decision Trees and tree based advanced techniques course! After completing this course you will be able to: Identify the business problem which can be solved using Decision tree/ Random Forest/ XGBoost of Machine Learning.


Understanding Boosted Trees Models

#artificialintelligence

In the previous post, we learned about tree based learning methods - basics of tree based models and the use of bagging to reduce variance. We also looked at one of the most famous learning algorithms based on the idea of bagging- random forests. In this post, we will look into the details of yet another type of tree-based learning algorithms: boosted trees. Boosting, similar to Bagging, is a general class of learning algorithm where a set of weak learners are combined to get strong learners. For classification problems, a weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.