Goto

Collaborating Authors

 Decision Tree Learning


Classifying the Unstructured IT Service Desk Tickets Using Ensemble of Classifiers

arXiv.org Artificial Intelligence

Manual classification of IT service desk tickets may result in routing of the tickets to the wrong resolution group. Incorrect assignment of IT service desk tickets leads to reassignment of tickets, unnecessary resource utilization and delays the resolution time. Traditional machine learning algorithms can be used to automatically classify the IT service desk tickets. Service desk ticket classifier models can be trained by mining the historical unstructured ticket description and the corresponding label. The model can then be used to classify the new service desk ticket based on the ticket description. The performance of the traditional classifier systems can be further improved by using various ensemble of classification techniques. This paper brings out the three most popular ensemble methods ie, Bagging, Boosting and Voting ensemble for combining the predictions from different models to further improve the accuracy of the ticket classifier system. The performance of the ensemble classifier system is checked against the individual base classifiers using various performance metrics. Ensemble of classifiers performed well in comparison with the corresponding base classifiers. The advantages of building such an automated ticket classifier systems are simplified user interface, faster resolution time, improved productivity, customer satisfaction and growth in business. The real world service desk ticket data from a large enterprise IT infrastructure is used for our research purpose.


Trees, Forests, Chickens, and Eggs: When and Why to Prune Trees in a Random Forest

arXiv.org Machine Learning

Due to their long-standing reputation as excellent off-the-shelf predictors, random forests continue remain a go-to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner-workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged -- one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades-old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of random forests use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that random forests with shallow trees are advantageous when the signal-to-noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of "double descent" in random forests by drawing parallels to U-statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.


How to select Best Split in Decision Trees using Chi-Square

#artificialintelligence

Let's see how we can calculate the expected values. If you recall this is how the split on "Performance in class" looks like- There is a total of 20 students and out of those 10 play cricket and 10 do not. So, of course, the percent of students who do play cricket will be 50%. Now if we consider the "Above average" node here, there are 14 students in it, as the percentage of students who play cricket is 50% in the parent node as we discussed, the expected number of students who play cricket will of course be 7 and if you look at the actual value it is 8. So now we have both the values expected values and actual values.


Building a Rock Paper Scissors AI

#artificialintelligence

In this article, I'll walk you through my process of building a full stack python Flask artificial intelligence project capable of beating the human user over 60% of the time using a custom scoring system to ensemble six models (naรฏve logic-based, decision tree, neural network) trained on both game-level and stored historical data in AWS RDS Cloud SQL database. Rock Paper Scissors caught my attention for an AI project because, on the surface, it seems impossible to get an edge in the game. These days, it is easy to assume that a computer can beat you in chess, because it can harness all of its computing power to see all possible outcomes and choose the ones that benefit it. Rock Paper Scissors, on the other hand, is commonly used in place of a coin toss to solve disputes because the winner seems random. My theory though, was that humans can't actually make random decisions, and that if an AI could learn to understand the ways in which humans make their choices over the course of a series of matches, even if the human was trying to behave randomly, then the AI would be able to significantly exceed 33% accuracy in guessing the player's decisions.


Mining GIS Data to Predict Urban Sprawl

arXiv.org Artificial Intelligence

This paper addresses the interesting problem of processing and analyzing data in geographic information systems (GIS) to achieve a clear perspective on urban sprawl. The term urban sprawl refers to overgrowth and expansion of low-density areas with issues such as car dependency and segregation between residential versus commercial use. Sprawl has impacts on the environment and public health. In our work, spatiotemporal features related to real GIS data on urban sprawl such as population growth and demographics are mined to discover knowledge for decision support. We adapt data mining algorithms, Apriori for association rule mining and J4.8 for decision tree classification to geospatial analysis, deploying the ArcGIS tool for mapping. Knowledge discovered by mining this spatiotemporal data is used to implement a prototype spatial decision support system (SDSS). This SDSS predicts whether urban sprawl is likely to occur. Further, it estimates the values of pertinent variables to understand how the variables impact each other. The SDSS can help decision-makers identify problems and create solutions for avoiding future sprawl occurrence and conducting urban planning where sprawl already occurs, thus aiding sustainable development. This work falls in the broad realm of geospatial intelligence and sets the stage for designing a large scale SDSS to process big data in complex environments, which constitutes part of our future work.


Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges

arXiv.org Machine Learning

Interpretability in machine learning (ML) is crucial for high stakes decisions and troubleshooting. In this work, we provide fundamental principles for interpretable ML, and dispel common misunderstandings that dilute the importance of this crucial topic. We also identify 10 technical challenge areas in interpretable machine learning and provide history and background on each problem. Some of these problems are classically important, and some are recent problems that have arisen in the last few years. These problems are: (1) Optimizing sparse logical models such as decision trees; (2) Optimization of scoring systems; (3) Placing constraints into generalized additive models to encourage sparsity and better interpretability; (4) Modern case-based reasoning, including neural networks and matching for causal inference; (5) Complete supervised disentanglement of neural networks; (6) Complete or even partial unsupervised disentanglement of neural networks; (7) Dimensionality reduction for data visualization; (8) Machine learning models that can incorporate physics and other generative or causal constraints; (9) Characterization of the "Rashomon set" of good models; and (10) Interpretable reinforcement learning. This survey is suitable as a starting point for statisticians and computer scientists interested in working in interpretable machine learning.


Decision Trees and Random Forest

#artificialintelligence

A very popular algorithm, in Machine Learning, is the Decision Tree Classifier. In this article, the Banknote dataset will be used to illustrate the capabilities of this model. A decision tree is a basic machine learning algorithm that can be used for classification problems. From a high level, a decision tree starts with a basic statement at the top of the tree, and then based on if that statement is True or False, it will then move down a different path to the next condition. This will then continue throughout the duration of the model.


Fake News Detection Using Python

#artificialintelligence

This is my first data analysis related video. In this video, I have solved the Fake news detection problem using four machine learning classification algorithms. Hi everyone, This is my first data analysis related video. In this video, I have solved the Fake news detection problem using four machine learning classification algorithms. From this video, you will learn how you can apply Linear regression, Decision Tree classification, Gradient boost classification, and random forest classification model.


Decision Theoretic Bootstrapping

arXiv.org Machine Learning

The design and testing of supervised machine learning models combine two fundamental distributions: (1) the training data distribution (2) the testing data distribution. Although these two distributions are identical and identifiable when the data set is infinite; they are imperfectly known (and possibly distinct) when the data is finite (and possibly corrupted) and this uncertainty must be taken into account for robust Uncertainty Quantification (UQ). We present a general decision-theoretic bootstrapping solution to this problem: (1) partition the available data into a training subset and a UQ subset (2) take $m$ subsampled subsets of the training set and train $m$ models (3) partition the UQ set into $n$ sorted subsets and take a random fraction of them to define $n$ corresponding empirical distributions $\mu_{j}$ (4) consider the adversarial game where Player I selects a model $i\in\left\{ 1,\ldots,m\right\} $, Player II selects the UQ distribution $\mu_{j}$ and Player I receives a loss defined by evaluating the model $i$ against data points sampled from $\mu_{j}$ (5) identify optimal mixed strategies (probability distributions over models and UQ distributions) for both players. These randomized optimal mixed strategies provide optimal model mixtures and UQ estimates given the adversarial uncertainty of the training and testing distributions represented by the game. The proposed approach provides (1) some degree of robustness to distributional shift in both the distribution of training data and that of the testing data (2) conditional probability distributions on the output space forming aleatory representations of the uncertainty on the output as a function of the input variable.


High number of unique values and tree based models

#artificialintelligence

Having columns of data with high cardinality can adversely affect the performance of your models. The idea of this article stemmed from my personal experience of employing tree based solutions in various projects. In this article I will attempt to show the effects of this on a couple of datasets using the simple decision tree. Cardinality can be defined as the uniqueness of data in the machine learning context. Examples of fields with a high number of unique values include cities, countries, medical diagnosis codes, movie categories on Netflix, flavours of ice cream, etc.