Decision Tree Learning
Log-based Evaluation of Label Splits for Process Models
Tax, Niek, Sidorova, Natalia, Haakma, Reinder, van der Aalst, Wil M. P.
Process mining techniques aim to extract insights in processes from event logs. One of the challenges in process mining is identifying interesting and meaningful event labels that contribute to a better understanding of the process. Our application area is mining data from smart homes for elderly, where the ultimate goal is to signal deviations from usual behavior and provide timely recommendations in order to extend the period of independent living. Extracting individual process models showing user behavior is an important instrument in achieving this goal. However, the interpretation of sensor data at an appropriate abstraction level is not straightforward. For example, a motion sensor in a bedroom can be triggered by tossing and turning in bed or by getting up. We try to derive the actual activity depending on the context (time, previous events, etc.). In this paper we introduce the notion of label refinements, which links more abstract event descriptions with their more refined counterparts. We present a statistical evaluation method to determine the usefulness of a label refinement for a given event log from a process perspective. Based on data from smart homes, we show how our statistical evaluation method for label refinements can be used in practice. Our method was able to select two label refinements out of a set of candidate label refinements that both had a positive effect on model precision.
Interpretable Machine Learning Models for the Digital Clock Drawing Test
Souillard-Mandar, William, Davis, Randall, Rudin, Cynthia, Au, Rhoda, Penney, Dana
The Clock Drawing Test (CDT) is a rapid, inexpensive, and popular neuropsychological screening tool for cognitive conditions. The Digital Clock Drawing Test (dCDT) uses novel software to analyze data from a digitizing ballpoint pen that reports its position with considerable spatial and temporal precision, making possible the analysis of both the drawing process and final product. We developed methodology to analyze pen stroke data from these drawings, and computed a large collection of features which were then analyzed with a variety of machine learning techniques. The resulting scoring systems were designed to be more accurate than the systems currently used by clinicians, but just as interpretable and easy to use. The systems also allow us to quantify the tradeoff between accuracy and interpretability. We created automated versions of the CDT scoring systems currently used by clinicians, allowing us to benchmark our models, which indicated that our machine learning models substantially outperformed the existing scoring systems.
An effective approach for classification of advanced malware with high accuracy
Sharma, Ashu, Sahay, Sanjay K.
Combating malware is very important for software/systems security, but to prevent the software/systems from the advanced malware, viz. metamorphic malware is a challenging task, as it changes the structure/code after each infection. Therefore in this paper, we present a novel approach to detect the advanced malware with high accuracy by analyzing the occurrence of opcodes (features) by grouping the executables. These groups are made on the basis of our earlier studies [1] that the difference between the sizes of any two malware generated by popular advanced malware kits viz. PS-MPC, G2 and NGVCK are within 5 KB. On the basis of obtained promising features, we studied the performance of thirteen classifiers using N-fold cross-validation available in machine learning tool WEKA. Among these thirteen classifiers we studied in-depth top five classifiers (Random forest, LMT, NBT, J48 and FT) and obtain more than 96.28% accuracy for the detection of unknown malware, which is better than the maximum detection accuracy (95.9%) reported by Santos et al (2013). In these top five classifiers, our approach obtained a detection accuracy of 97.95% by the Random forest.
S0264837716305440
In this work, we particularly focus on the complex relationship between land-use and transport offering an innovative approach to the problem by using land-use features at two differing levels of granularity (the more general land-use sector types and the more granular amenity structures) to evaluate their impact on public transit ridership in both time and space. To quantify the interdependencies, we explored three machine learning models and demonstrate that the decision tree model performs best in terms of overall performance--good predictive accuracy, generality, computational efficiency, and "interpretability". We then demonstrate how the developed framework can be applied to urban planning for transit-oriented development by exploring practicable scenarios based on Singapore's urban plan toward 2030, which includes the development of "regional centers" (RCs) across the city-state. This trend, on the other hand, eventually reverses (particularly during peak hours) with continued strategic increase in amenities; a tipping point at 55% increase is identified where ridership begins to decline and at 110%, the predicted ridership begins to fall below current levels.
Impacts of land use and amenities on public transport use, urban planning and design
Various land-use configurations are known to have wide-ranging effects on the dynamics of and within other city components including the transportation system. In this work, we particularly focus on the complex relationship between land-use and transport offering an innovative approach to the problem by using land-use features at two differing levels of granularity (the more general land-use sector types and the more granular amenity structures) to evaluate their impact on public transit ridership in both time and space. To quantify the interdependencies, we explored three machine learning models and demonstrate that the decision tree model performs best in terms of overall performance--good predictive accuracy, generality, computational efficiency, and "interpretability". Results also reveal that amenity-related features are better predictors than the more general ones, which suggests that high-resolution geo-information can provide more insights into the dependence of transit ridership on land-use. We then demonstrate how the developed framework can be applied to urban planning for transit-oriented development by exploring practicable scenarios based on Singapore's urban plan toward 2030, which includes the development of "regional centers" (RCs) across the city-state.
Rise Of Automated Trading: Machines Trading S&P 500
Putting it all together, the following example shows the equity curve representing cumulative returns of the model strategy, with all values expressed in dollars. To increase the precision of forecasted values, instead of a standard probability of 0.5 (50 percent) we choose a higher threshold value, to be more confident that the model predicts an Up day. As we can see by the chart above, the equity curve is much better than before (Sharpe is 6.5 instead of 3.5), even with fewer round turns. From this point on, we will consider all next models with a threshold higher than a standard value. We can apply our research, as we did previously with the decision tree, into a Logistic Classifier model.
The Effect of Heteroscedasticity on Regression Trees
Regression trees are becoming increasingly popular as omnibus predicting tools and as the basis of numerous modern statistical learning ensembles. Part of their popularity is their ability to create a regression prediction without ever specifying a structure for the mean model. However, the method implicitly assumes homogeneous variance across the entire explanatory-variable space. It is unknown how the algorithm behaves when faced with heteroscedastic data. In this study, we assess the performance of the most popular regression-tree algorithm in a single-variable setting under a very simple step-function model for heteroscedasticity. We use simulation to show that the locations of splits, and hence the ability to accurately predict means, are both adversely influenced by the change in variance. We identify the pruning algorithm as the main concern, although the effects on the splitting algorithm may be meaningful in some applications.
ACDC: $\alpha$-Carving Decision Chain for Risk Stratification
Park, Yubin, Ho, Joyce, Ghosh, Joydeep
In many healthcare settings, intuitive decision rules for risk stratification can help effective hospital resource allocation. This paper introduces a novel variant of decision tree algorithms that produces a chain of decisions, not a general tree. Our algorithm, $\alpha$-Carving Decision Chain (ACDC), sequentially carves out "pure" subsets of the majority class examples. The resulting chain of decision rules yields a pure subset of the minority class examples. Our approach is particularly effective in exploring large and class-imbalanced health datasets. Moreover, ACDC provides an interactive interpretation in conjunction with visual performance metrics such as Receiver Operating Characteristics curve and Lift chart.
Making Tree Ensembles Interpretable
Tree ensembles, such as random forest and boosted trees, are renowned for their high prediction performance, whereas their interpretability is critically limited. In this paper, we propose a post processing method that improves the model interpretability of tree ensembles. After learning a complex tree ensembles in a standard way, we approximate it by a simpler model that is interpretable for human. To obtain the simpler model, we derive the EM algorithm minimizing the KL divergence from the complex ensemble. A synthetic experiment showed that a complicated tree ensemble was approximated reasonably as interpretable.
Pruning Random Forests for Prediction on a Budget
Nan, Feng, Wang, Joseph, Saligrama, Venkatesh
We propose to prune a random forest (RF) for resource-constrained prediction. We first construct a RF and then prune it to optimize expected feature cost & accuracy. We pose pruning RFs as a novel 0-1 integer program with linear constraints that encourages feature re-use. We establish total unimodularity of the constraint set to prove that the corresponding LP relaxation solves the original integer program. We then exploit connections to combinatorial optimization and develop an efficient primal-dual algorithm, scalable to large datasets. In contrast to our bottom-up approach, which benefits from good RF initialization, conventional methods are top-down acquiring features based on their utility value and is generally intractable, requiring heuristics. Empirically, our pruning algorithm outperforms existing state-of-the-art resource-constrained algorithms.