Decision Tree Learning
Ensemble Models with Trees and Rules
In this article, we have proposed several approaches for post processing a large ensemble of prediction models or rules. The results from our simulations show that the post processing methods we have considered here are promising. We have used the techniques developed here for estimation of quantitative traits from markers, on the benchmark "Bostob Housing"data set and in some simulations. In most cases, the produced models had better prediction performance than, for example, the ones produced by the random forest or the rulefit algorithms.
Performance Tuning Of J48 Algorithm For Prediction Of Soil Fertility
The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use("data mining",Wikipedia). A soil test is the analysis of a soil sample to determine nutrient content, composition and other characteristics. Tests are usually performed to measure fertility and indicate deficiencies that need to be remedied ("Soil Test", Wikipedia).. In this research, soil dataset containing soil test results has been used to apply various classification techniques in data mining. Soil fertility is a crucial attribute which is considered for land evaluation, also achieving and maintaining necessary levels of fertility is important for nurturing crop production, hence this paper includes steps for building an efficient and accurate predictive model of soil fertility with the help of J48 algorithm.
Learning Driver's Behavior to Improve the Acceptance of Adaptive Cruise Control
Rosenfeld, Avi (Jerusalem College of Technology) | Bareket, Zevi (University of Michigan) | Goldman, Claudia V. (General Motors Advanced Technical Center) | Kraus, Sarit (Bar-Ilan University) | LeBlanc, David J. (University of Michigan) | Tsimhoni, Omer (General Motors Advanced Technical Center)
Adaptive Cruise Control (ACC) is a technology that allows a vehicle to automatically adjust its speed to maintain a preset distance from the vehicle in front of it based on the driver's preferences. Individual drivers have different driving styles and preferences. Current systems do not distinguish among the users. We introduce a method to combine machine learning algorithms with demographic information and expert advice into existing automated assistive systems. This method can save on the interactions between drivers and automated systems by adjusting parameters relevant to the operation of these systems based on their specific drivers and context of drive. We also learn when users tend to engage and disengage the automated system. This method sheds light on the kinds of dynamics that users develop while interacting with automation and can teach us how to improve these systems for the benefit of their users. While accepted packages such as Weka were successful in learning drivers' behavior, we found that improved learning models could be developed by adding information on drivers' demographics and a previously developed model about different driver types. We present the general methodology of our learning procedure and suggest applications of our approach to other domains as well.
Using a Critic to Promote Less Popular Candidates in a People-to-People Recommender System
Krzywicki, Alfred (University of New South Wales) | Wobcke, Wayne (University of New South Wales) | Cai, Xiongcai (University of New South Wales) | Bain, Michael (University of New South Wales) | Mahidadia, Ashesh (University of New South Wales) | Compton, Paul (University of New South Wales) | Kim, Yang Sok (University of New South Wales)
This paper shows how to improve the recommendations of an interaction-based collaborative filtering (IBCF) recommender used in online dating. Previous work has shown that IBCF works well in this domain, although it tends to rank popular candidates highly, which leads to these users receiving a large number of contacts. We address this problem by using a Decision Tree model as a "critic" to re-rank the candidates generated by IBCF, effectively promoting less popular candidates. This method was first evaluated on historical data from a large online dating site and then trialled live on the same site by providing recommendations to a large number of users throughout a 9 week period. The live trial confirmed the consistency of the analysis on historical data and the ability of the method to generate suitable candidates over an extended period. Our recommendations gave higher success rates than those for a control group made with a baseline recommender.
Table Header Detection and Classification
Fang, Jing (Peking University) | Mitra, Prasenjit (The Pennsylvania State University) | Tang, Zhi (Peking University) | Giles, C. Lee (The Pennsylvania State University)
In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of .that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeerX to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92%.
Biogeography-Based Informative Gene Selection and Cancer Classification Using SVM and Random Forests
Nikumbh, Sarvesh, Ghosh, Shameek, Jayaraman, Valadi
Microarray cancer gene expression data comprise of very high dimensions. Reducing the dimensions helps in improving the overall analysis and classification performance. We propose two hybrid techniques, Biogeography - based Optimization - Random Forests (BBO - RF) and BBO - SVM (Support Vector Machines) with gene ranking as a heuristic, for microarray gene expression analysis. This heuristic is obtained from information gain filter ranking procedure. The BBO algorithm generates a population of candidate subset of genes, as part of an ecosystem of habitats, and employs the migration and mutation processes across multiple generations of the population to improve the classification accuracy. The fitness of each gene subset is assessed by the classifiers - SVM and Random Forests. The performances of these hybrid techniques are evaluated on three cancer gene expression datasets retrieved from the Kent Ridge Biomedical datasets collection and the libSVM data repository. Our results demonstrate that genes selected by the proposed techniques yield classification accuracies comparable to previously reported algorithms.
Tracking Tetrahymena Pyriformis Cells using Decision Trees
Wang, Quan, Ou, Yan, Julius, A. Agung, Boyer, Kim L., Kim, Min Jun
Matching cells over time has long been the most difficult step in cell tracking. In this paper, we approach this problem by recasting it as a classification problem. W e construct a feature set for each cell, and compute a feature difference vector between a cell in the current frame and a cell in a previous frame. Then we determine whether the two cells represent the same cell over time by training decision trees as our binary classifiers. With the output of decision trees, we are able to formulate an assignment problem for our cell association task and solve it using a modified version of the Hungarian algorithm.
MOB-ESP and other Improvements in Probability Estimation
A key prerequisite to optimal reasoning under uncertainty in intelligent systems is to start with good class probability estimates. This paper improves on the current best probability estimation trees (Bagged-PETs) and also presents a new ensemble-based algorithm (MOB-ESP). Comparisons are made using several benchmark datasets and multiple metrics. These experiments show that MOB-ESP outputs significantly more accurate class probabilities than either the baseline B-PETs algorithm or the enhanced version presented here (EB-PETs). These results are based on metrics closely associated with the average accuracy of the predictions. MOB-ESP also provides much better probability rankings than B-PETs. The paper further suggests how these estimation techniques can be applied in concert with a broader category of classifiers.
Obtaining Calibrated Probabilities from Boosting
Niculescu-Mizil, Alexandru, Caruana, Richard A.
Boosted decision trees typically yield good accuracy, precision, and ROC area. However, because the outputs from boosting are not well calibrated posterior probabilities, boosting yields poor squared error and cross-entropy. We empirically demonstrate why AdaBoost predicts distorted probabilities and examine three calibration methods for correcting this distortion: Platt Scaling, Isotonic Regression, and Logistic Correction. We also experiment with boosting using log-loss instead of the usual exponential loss. Experiments show that Logistic Correction and boosting with log-loss work well when boosting weak models such as decision stumps, but yield poor performance when boosting more complex models such as full decision trees. Platt Scaling and Isotonic Regression, however, significantly improve the probabilities predicted by
Improved Information Gain Estimates for Decision Tree Induction
Ensembles of classification and regression trees remain popular machine learning methods because they define flexible non-parametric models that predict well and are computationally efficient both during training and testing. During induction of decision trees one aims to find predicates that are maximally informative about the prediction target. To select good predicates most approaches estimate an information-theoretic scoring function, the information gain, both for classification and regression problems. We point out that the common estimation procedures are biased and show that by replacing them with improved estimators of the discrete and the differential entropy we can obtain better decision trees. In effect our modifications yield improved predictive performance and are simple to implement in any decision tree code.