Goto

Collaborating Authors

 Decision Tree Learning


2018 World Cup Predictions using decision trees

#artificialintelligence

In this study, we predict the outcome of the football matches in the FIFA World Cup 2018 to be held in Russia this summer. We do this using classification models over a dataset of historic football results that includes attributes from the playing teams by rating them in attack, midfield, defence, aggression, pressure, chance creation and building ability. This last training data was a result of merging international matches results with AE games ratings of the teams considering the timeline of the matches with their respective statistics. Final predictions show the four countries with the most chances of getting to the semifinals as France, Brazil, Spain and Germany while giving Spain as the winner. The objective of this study is to build a predictive model that will allow us to make good predictions for the coming World Cup 2018 so we looked for dataset with historic data for match results, for this purpose we chose a dataset from Kaggle with data of almost 40,000 international matches played between 1872 and 2018.


Anatomy of Online Hate: Developing a Taxonomy and Machine Learning Models for Identifying and Classifying Hate in Online News Media

AAAI Conferences

Online social media platforms generally attempt to mitigate hateful expressions, as these comments can be detrimental to the health of the community. However, automatically identifying hateful comments can be challenging. We manually label 5,143 hateful expressions posted to YouTube and Facebook videos among a dataset of 137,098 comments from an online news media. We then create a granular taxonomy of different types and targets of online hate and train machine learning models to automatically detect and classify the hateful comments in the full dataset. Our contribution is twofold: 1) creating a granular taxonomy for hateful online comments that includes both types and targets of hateful comments, and 2) experimenting with machine learning, including Logistic Regression, Decision Tree, Random Forest, Adaboost, and Linear SVM, to generate a multiclass, multilabel classification model that automatically detects and categorizes hateful comments in the context of online news media. We find that the best performing model is Linear SVM, with an average F1 score of 0.79 using TF-IDF features. We validate the model by testing its predictive ability, and, relatedly, provide insights on distinct types of hate speech taking place on social media.


Instance-Level Explanations for Fraud Detection: A Case Study

arXiv.org Artificial Intelligence

Fraud detection is a difficult problem that can benefit from predictive modeling. However, the verification of a prediction is challenging; for a single insurance policy, the model only provides a prediction score. We present a case study where we reflect on different instance-level model explanation techniques to aid a fraud detection team in their work. To this end, we designed two novel dashboards combining various state-of-the-art explanation techniques. These enable the domain expert to analyze and understand predictions, dramatically speeding up the process of filtering potential fraud cases. Finally, we discuss the lessons learned and outline open research issues.


Deep Neural Decision Trees

arXiv.org Machine Learning

Deep neural networks have been proven powerful at processing perceptual data, such as images and audio. However for tabular data, tree-based models are more popular. A nice property of tree-based models is their natural interpretability. In this work, we present Deep Neural Decision Trees (DNDT) -- tree models realised by neural networks. A DNDT is intrinsically interpretable, as it is a tree. Yet as it is also a neural network (NN), it can be easily implemented in NN toolkits, and trained with gradient descent rather than greedy splitting. We evaluate DNDT on several tabular datasets, verify its efficacy, and investigate similarities and differences between DNDT and vanilla decision trees. Interestingly, DNDT self-prunes at both split and feature-level.


Comparison-Based Random Forests

arXiv.org Machine Learning

Assume we are given a set of items from a general metric space, but we neither have access to the representation of the data nor to the distances between data points. Instead, suppose that we can actively choose a triplet of items (A,B,C) and ask an oracle whether item A is closer to item B or to item C. In this paper, we propose a novel random forest algorithm for regression and classification that relies only on such triplet comparisons. In the theory part of this paper, we establish sufficient conditions for the consistency of such a forest. In a set of comprehensive experiments, we then demonstrate that the proposed random forest is efficient both for classification and regression. In particular, it is even competitive with other methods that have direct access to the metric representation of the data.


Machine learning predicts World Cup winner

#artificialintelligence

The random-forest technique has emerged in recent years as a powerful way to analyze large data sets while avoiding some of the pitfalls of other data-mining methods. It is based on the idea that some future event can be determined by a decision tree in which an outcome is calculated at each branch by reference to a set of training data. However, decision trees suffer from a well-known problem. In the latter stages of the branching process, decisions can become severely distorted by training data that is sparse and prone to huge variation at this kind of resolution, a problem known as overfitting. The random-forest approach is different.


A Taxonomy and Survey of Intrusion Detection System Design Techniques, Network Threats and Datasets

arXiv.org Artificial Intelligence

With the world moving towards being increasingly dependent on computers and automation, one of the main challenges in the current decade has been to build secure applications, systems and networks. Alongside these challenges, the number of threats is rising exponentially due to the attack surface increasing through numerous interfaces offered for each service. To alleviate the impact of these threats, researchers have proposed numerous solutions; however, current tools often fail to adapt to ever-changing architectures, associated threats and 0-days. This manuscript aims to provide researchers with a taxonomy and survey of current dataset composition and current Intrusion Detection Systems (IDS) capabilities and assets. These taxonomies and surveys aim to improve both the efficiency of IDS and the creation of datasets to build the next generation IDS as well as to reflect networks threats more accurately in future datasets. To this end, this manuscript also provides a taxonomy and survey or network threats and associated tools. The manuscript highlights that current IDS only cover 25% of our threat taxonomy, while current datasets demonstrate clear lack of real-network threats and attack representation, but rather include a large number of deprecated threats, hence limiting the accuracy of current machine learning IDS. Moreover, the taxonomies are open-sourced to allow public contributions through a Github repository.


Machine learning detects lymphedema in breast cancer survivors

#artificialintelligence

A new study led by NYU Rory Meyers College of Nursing shows that machine learning--combined with the collection of real-time symptom reports using a mHealth system--can provide early detection and help patients to receive timely intervention to effectively manage lymphedema. Lymphedema, which has no cure and comes with lifelong risk, is the build-up of lymph fluid that causes swelling in the arms or legs of patients. In the study of 355 women from 45 states who had undergone treatment for breast cancer, the performance of five machine learning algorithms were evaluated--artificial neural network (ANN), Decision Tree of C4.5, Decision Tree of C5.0, gradient boosting model and support vector machine. According to results published in the journal mHealth, all five machine learning approaches outperformed the conventional statistical approach. However, of the five, the ANN achieved the best performance for detecting lymphedema with accuracy of 93.75 percent, sensitivity of 95.65 percent and specificity of 91.03 percent.


Native scoring in SQL Server 2017 using R

#artificialintelligence

Native scoring is a much overlooked feature in SQL Server 2017 (available only under Windows and only on-prem), that provides scoring and predicting in pre-build and stored machine learning models in near real-time. Depending on the definition of real-time, and what does it mean for your line of business, I will not go into the definition of real-time, but for sure, we can say scoring 10.000 rows in a second from a mediocre client computer (similar to mine) . Native scoring in SQL Server 2017 comes with couple of limitations, but also with a lot of benefits. Overall, if you are looking for a faster predictions in your enterprise and would love to have a faster code and solution deployment, especially integration with other applications or building API in your ecosystem, native scoring with PREDICT function will surely be advantage to you. Although not all of the predictions/scores are supported, majority of predictions can be done using regression models or decision trees models (it is estimated that both type (with derivatives of regression models and ensemble methods) of algorithms are used in 85% of the predictive analytics).


Decision Trees for Classification: A Machine Learning Algorithm Xoriant Blog

#artificialintelligence

Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what the corresponding output is in the training data) where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are where the data is split. An example of a decision tree can be explained using above binary tree.