Goto

Collaborating Authors

 Diagnosis


Introduction to Outlier Detection Methods

#artificialintelligence

This post is a summary of 3 different posts about outlier detection methods. One of the challenges in data analysis in general and predictive modeling in particular is dealing with outliers. There are many modeling techniques which are resistant to outliers or reduce the impact of them, but still detecting outliers and understanding them can lead to interesting findings. We generally define outliers as samples that are exceptionally far from the mainstream of data.There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise. There are several approaches for detecting Outliers.


How to Bin or Convert Numerical Variables to Categorical Variables with Decision Trees

@machinelearnbot

Why would you want to convert a numerical variable into categorical one? Depending on the situation, it can lead to a better interpretation of the numerical variable, quick segmentation or just an additional feature for building your predictive model by creating bins for the numerical variable. Binning is a popular feature engineering technique. Suppose your hypothesis is that the age of a customer is correlated with their tendency to interact with a mobile app. The age of the user is plotted on x-axis and user interaction with the app is plotted on the y-axis.


Debugging Machine Learning Tasks

arXiv.org Machine Learning

Unlike traditional programs (such as operating systems or word processors) which have large amounts of code, machine learning tasks use programs with relatively small amounts of code (written in machine learning libraries), but voluminous amounts of data. Just like developers of traditional programs debug errors in their code, developers of machine learning tasks debug and fix errors in their data. However, algorithms and tools for debugging and fixing errors in data are less common, when compared to their counterparts for detecting and fixing errors in code. In this paper, we consider classification tasks where errors in training data lead to misclassifications in test points, and propose an automated method to find the root causes of such misclassifications. Our root cause analysis is based on Pearl's theory of causation, and uses Pearl's PS (Probability of Sufficiency) as a scoring metric. Our implementation, Psi, encodes the computation of PS as a probabilistic program, and uses recent work on probabilistic programs and transformations on probabilistic programs (along with gray-box models of machine learning algorithms) to efficiently compute PS. Psi is able to identify root causes of data errors in interesting data sets.


Annotated Decision Trees for Simple Moral Machines

AAAI Conferences

Autonomization often follows after the automization on which it is based. More and more machines have to make decisions with moral implications. Machine ethics, which can be seen as an equivalent of human ethics, analyses the chances and limits of moral machines. So far, decision trees have not been commonly used for modelling moral machines. This article proposes an approach for creating annotated decision trees, and specifies their central components. The focus is on simple moral machines. The chances of such models are illustrated with the example of a self-driving car that is friendly to humans and animals. Finally the advantages and disadvantages are discussed and conclusions are drawn.


The Devilโ€™s Triangle: Ethical Considerations on Developing Bot Detection Methods

AAAI Conferences

Social media is increasingly populated with bots. To protect the authenticity of the user, experience machine learning algorithms are used to detect these bots. Ethical dimensions of these methods have not been thoroughly considered yet. Taking histogram analysis of Twitter users' profile images as example, the paper demonstrates the trade-offs of accuracy, transparency, and robustness. Because there is no general optimum in ethical considerations, these dimensions form a "devil's triangle".


Data-Augmented Software Diagnosis

AAAI Conferences

Software fault prediction algorithms predict which software components is likely to contain faults using machine learning techniques. Software diagnosis algorithm identify the faulty software components that caused a failure using model-based or spectrum based approaches. We show how software fault prediction algorithms can be used to improve software diagnosis. The resulting data-augmented diagnosis algorithm overcomes key problems in software diagnosis algorithms: ranking diagnoses and distinguishing between diagnoses with high probability and low probability. We demonstrate the efficiency of the proposed approach empirically on three open sources domains, showing significant increase in accuracy of diagnosis and efficiency of troubleshooting. These encouraging results suggests broader use of data-driven methods to complement and improve existing model-based methods.


Optimally Pruning Decision Tree Ensembles With Feature Cost

arXiv.org Machine Learning

We consider the problem of learning decision rules for prediction with feature budget constraint. In particular, we are interested in pruning an ensemble of decision trees to reduce expected feature cost while maintaining high prediction accuracy for any test example. We propose a novel 0-1 integer program formulation for ensemble pruning. Our pruning formulation is general - it takes any ensemble of decision trees as input. By explicitly accounting for feature-sharing across trees together with accuracy/cost trade-off, our method is able to significantly reduce feature cost by pruning subtrees that introduce more loss in terms of feature cost than benefit in terms of prediction accuracy gain. Theoretically, we prove that a linear programming relaxation produces the exact solution of the original integer program. This allows us to use efficient convex optimization tools to obtain an optimally pruned ensemble for any given budget. Empirically, we see that our pruning algorithm significantly improves the performance of the state of the art ensemble method BudgetRF.


Query-Answer Causality in Databases: Abductive Diagnosis and View-Updates

arXiv.org Artificial Intelligence

Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between query causality and consistency-based diagnosis and database repairs (wrt. integrity constrain violations) have been established in the literature. In this work we establish connections between query causality and abductive diagnosis and the view-update problem. The unveiled relationships allow us to obtain new complexity results for query causality -the main focus of our work- and also for the two other areas.


Causal Decision Trees

arXiv.org Artificial Intelligence

Uncovering causal relationships in data is a major objective of data analytics. Causal relationships are normally discovered with designed experiments, e.g. randomised controlled trials, which, however are expensive or infeasible to be conducted in many cases. Causal relationships can also be found using some well designed observational studies, but they require domain experts' knowledge and the process is normally time consuming. Hence there is a need for scalable and automated methods for causal relationship exploration in data. Classification methods are fast and they could be practical substitutes for finding causal signals in data. However, classification methods are not designed for causal discovery and a classification method may find false causal signals and miss the true ones. In this paper, we develop a causal decision tree where nodes have causal interpretations. Our method follows a well established causal inference framework and makes use of a classic statistical test. The method is practical for finding causal signals in large data sets.


Appropriate Causal Models and the Stability of Causation

arXiv.org Artificial Intelligence

Causal models defined in terms of structural equations have proved to be quite a powerful way of representing knowledge regarding causality. However, a number of authors have given examples that seem to show that the Halpern-Pearl (HP) definition of causality gives intuitively unreasonable answers. Here it is shown that, for each of these examples, we can give two stories consistent with the description in the example, such that intuitions regarding causality are quite different for each story. By adding additional variables, we can disambiguate the stories. Moreover, in the resulting causal models, the HP definition of causality gives the intuitively correct answer. It is also shown that, by adding extra variables, a modification to the original HP definition made to deal with an example of Hopkins and Pearl may not be necessary. Given how much can be done by adding extra variables, there might be a concern that the notion of causality is somewhat unstable. Can adding extra variables in a "conservative" way (i.e., maintaining all the relations between the variables in the original model) cause the answer to the question "Is X=x a cause of Y=y" to alternate between "yes" and "no"? It is shown that we can have such alternation infinitely often, but if we take normality into consideration, we cannot. Indeed, under appropriate normality assumptions. adding an extra variable can change the answer from "yes" to "no", but after that, it cannot cannot change back to "yes".