Goto

Collaborating Authors

 Decision Tree Learning


Random Intersection Trees

arXiv.org Machine Learning

Finding interactions between variables in large and high-dimensional datasets is often a serious computational challenge. Most approaches build up interaction sets incrementally, adding variables in a greedy fashion. The drawback is that potentially informative high-order interactions may be overlooked. Here, we propose at an alternative approach for classification problems with binary predictor variables, called Random Intersection Trees. It works by starting with a maximal interaction that includes all variables, and then gradually removing variables if they fail to appear in randomly chosen observations of a class of interest. We show that informative interactions are retained with high probability, and the computational complexity of our procedure is of order $p^\kappa$ for a value of $\kappa$ that can reach values as low as 1 for very sparse data; in many more general settings, it will still beat the exponent $s$ obtained when using a brute force search constrained to order $s$ interactions. In addition, by using some new ideas based on min-wise hash schemes, we are able to further reduce the computational cost. Interactions found by our algorithm can be used for predictive modelling in various forms, but they are also often of interest in their own right as useful characterisations of what distinguishes a certain class from others.


Inverse Signal Classification for Financial Instruments

arXiv.org Machine Learning

The paper presents new machine learning methods: signal composition, which classifies time-series regardless of length, type, and quantity; and self-labeling, a supervised-learning enhancement. The paper describes further the implementation of the methods on a financial search engine system using a collection of 7,881 financial instruments traded during 2011 to identify inverse behavior among the time-series.


Bio-inspired data mining: Treating malware signatures as biosequences

arXiv.org Machine Learning

The application of machine learning to bioinformatics problems is well established. Less well understood is the application of bioinformatics techniques to machine learning and, in particular, the representation of non-biological data as biosequences. The aim of this paper is to explore the effects of giving amino acid representation to problematic machine learning data and to evaluate the benefits of supplementing traditional machine learning with bioinformatics tools and techniques. The signatures of 60 computer viruses and 60 computer worms were converted into amino acid representations and first multiply aligned separately to identify conserved regions across different families within each class (virus and worm). This was followed by a second alignment of all 120 aligned signatures together so that non-conserved regions were identified prior to input to a number of machine learning techniques. Differences in length between virus and worm signatures after the first alignment were resolved by the second alignment. Our first set of experiments indicates that representing computer malware signatures as amino acid sequences followed by alignment leads to greater classification and prediction accuracy. Our second set of experiments indicates that checking the results of data mining from artificial virus and worm data against known proteins can lead to generalizations being made from the domain of naturally occurring proteins to malware signatures. However, further work is needed to determine the advantages and disadvantages of different representations and sequence alignment methods for handling problematic machine learning data.


Using Temporal Data for Making Recommendations

arXiv.org Artificial Intelligence

We treat collaborative filtering as a univariate time series problem: given a user's previous votes, predict the next vote. We describe two families of methods for transforming data to encode time order in ways amenable to off-the-shelf classification and density estimation tools. Using a decision-tree learning tool and two real-world data sets, we compare the results of these approaches to the results of collaborative filtering without ordering information. The improvements in both predictive accuracy and in recommendation quality that we realize advocate the use of predictive algorithms exploiting the temporal order of data.


Towards Adapting Cars to their Drivers

AI Magazine

Traditionally, vehicles have been considered as machines that are controlled by humans for the purpose of transportation. A more modern view is to envision drivers and passengers as actively interacting with a complex automated system. Such interactive activity leads us to consider intelligent and advanced ways of interaction leading to cars that can adapt to their drivers.In this paper, we focus on the Adaptive Cruise Control (ACC) technology that allows a vehicle to automatically adjust its speed to maintain a preset distance from the vehicle in front of it based on the driver’s preferences. Although individual drivers have different driving styles and preferences, current systems do not distinguish among users. We introduce a method to combine machine learning algorithms with demographic information and expert advice into existing automated assistive systems. This method can reduce the interactions between drivers and automated systems by adjusting parameters relevant to the operation of these systems based on their specific drivers and context of drive. We also learn when users tend to engage and disengage the automated system. This method sheds light on the kinds of dynamics that users develop while interacting with automation and can teach us how to improve these systems for the benefit of their users. While generic packages such as Weka were successful in learning drivers’ behavior, we found that improved learning models could be developed by adding information on drivers’ demographics and a previously developed model about different driver types. We present the general methodology of our learning procedure and suggest applications of our approach to other domains as well.


Learning Partially Observable Models Using Temporally Abstract Decision Trees

Neural Information Processing Systems

This paper introduces timeline trees, which are partial models of partially observable environments. Timeline trees are given some specific predictions to make and learn a decision tree over history. The main idea of timeline trees is to use temporally abstract features to identify and split on features of key events, spread arbitrarily far apart in the past (whereas previous decision-tree-based methods have been limited to a finite suffix of history). Experiments demonstrate that timeline trees can learn to make high quality predictions in complex, partially observable environments with high-dimensional observations (e.g. an arcade game).


Making Early Predictions of the Accuracy of Machine Learning Applications

arXiv.org Artificial Intelligence

The accuracy of machine learning systems is a widely studied research topic. Established techniques such as cross-validation predict the accuracy on unseen data of the classifier produced by applying a given learning method to a given training data set. However, they do not predict whether incurring the cost of obtaining more data and undergoing further training will lead to higher accuracy. In this paper we investigate techniques for making such early predictions. We note that when a machine learning algorithm is presented with a training set the classifier produced, and hence its error, will depend on the characteristics of the algorithm, on training set's size, and also on its specific composition. In particular we hypothesise that if a number of classifiers are produced, and their observed error is decomposed into bias and variance terms, then although these components may behave differently, their behaviour may be predictable. We test our hypothesis by building models that, given a measurement taken from the classifier created from a limited number of samples, predict the values that would be measured from the classifier produced when the full data set is presented. We create separate models for bias, variance and total error. Our models are built from the results of applying ten different machine learning algorithms to a range of data sets, and tested with "unseen" algorithms and datasets. We analyse the results for various numbers of initial training samples, and total dataset sizes. Results show that our predictions are very highly correlated with the values observed after undertaking the extra training. Finally we consider the more complex case where an ensemble of heterogeneous classifiers is trained, and show how we can accurately estimate an upper bound on the accuracy achievable after further training.


Cost-sensitive C4.5 with post-pruning and competition

arXiv.org Artificial Intelligence

Decision tree is an effective classification approach in data mining and machine learning. In applications, test costs and misclassification costs should be considered while inducing decision trees. Recently, some cost-sensitive learning algorithms based on ID3 such as CS-ID3, IDX, \lambda-ID3 have been proposed to deal with the issue. These algorithms deal with only symbolic data. In this paper, we develop a decision tree algorithm inspired by C4.5 for numeric data. There are two major issues for our algorithm. First, we develop the test cost weighted information gain ratio as the heuristic information. According to this heuristic information, our algorithm is to pick the attribute that provides more gain ratio and costs less for each selection. Second, we design a post-pruning strategy through considering the tradeoff between test costs and misclassification costs of the generated decision tree. In this way, the total cost is reduced. Experimental results indicate that (1) our algorithm is stable and effective; (2) the post-pruning technique reduces the total cost significantly; (3) the competition strategy is effective to obtain a cost-sensitive decision tree with low cost.


Language Analysis of Speakers with Dementia of the Alzheimer’s Type

AAAI Conferences

This research is a discriminative analysis of conversational dialogs involving individuals suffering from dementia of Alzheimer’s type. Several metric analyses are applied to the transcripts of the Carolina Conversation Corpus (Pope and Davis 2011) in order to determine if there are significant statistical differences between individuals with and without Alzheimer’s disease. Results from the analysis indicate that go-ahead utterances, certain fluency measures, and paraphrasing provide defensible means of differentiating the linguistic characteristics of spontaneous speech between healthy individuals and those with Alzheimer’s disease. Several machine learning algorithms were used to classify the speech of individuals with and without dementia of the Alzheimer’s type.


Wikipedia Vandalism Detection Through Machine Learning: Feature Review and New Proposals: Lab Report for PAN at CLEF 2010

arXiv.org Artificial Intelligence

Wikipedia is an online encyclopedia built upon the collaborations of thousands of editors. Its collaboration model is simple: anyone can edit any article at any time. This has made possible the great success of Wikipedia, but it comes with its own problems, one of them being destructive edits. There are many ways in which an edit can be destructive for Wikipedia, such as lobbying, spam, vandalism, tests, etc. In PAN 2010 Lab's Task 2 we are focused on automatic detection of vandalism. The English Wikipedia defines vandalism as: [...] any addition, removal, or change of content made in a deliberate attempt to compromise the integrity of Wikipedia.