Goto

Collaborating Authors

 Decision Tree Learning


Differentially Private Median Forests for Regression and Classification

arXiv.org Machine Learning

Random forests are a popular method for classification and regression due to their versatility. However, this flexibility can come at the cost of user privacy, since training random forests requires multiple data queries, often on small, identifiable subsets of the training data. Differentially private approaches based on extremely random trees reduce the number of queries, but can lead to low-occupancy leaf nodes which require the addition of large amounts of noise. In this paper, we propose DiPriMe forests, a novel tree-based ensemble method for regression and classification problems, that ensures differential privacy while maintaining high utility. We construct trees based on a privatized version of the median value of attributes, obtained via the exponential mechanism. The use of the noisy median encourages balanced leaf nodes, ensuring that the noise added to the parameter estimate at each leaf is not overly large. The resulting algorithm, which is appropriate for real or categorical covariates, exhibits high utility while ensuring differential privacy.


A Survey of Machine Learning Methods and Challenges for Windows Malware Classification

arXiv.org Machine Learning

Malware classification is a difficult problem, to which machine learning methods have been applied for decades. Yet progress has often been slow, in part due to a number of unique difficulties with the task that occur through all stages of the developing a machine learning system: data collection, labeling, feature creation and selection, model selection, and evaluation. In this survey we will review a number of the current methods and challenges related to malware classification, including data collection, feature extraction, and model construction, and evaluation. Our discussion will include thoughts on the constraints that must be considered for machine learning based solutions in this domain, and yet to be tackled problems for which machine learning could also provide a solution. This survey aims to be useful both to cybersecurity practitioners who wish to learn more about how machine learning can be applied to the malware problem, and to give data scientists the necessary background into the challenges in this uniquely complicated space.


Detection of Coincidentally Correct Test Cases through Random Forests

arXiv.org Artificial Intelligence

The performance of coverage-based fault localization greatly depends on the quality of test cases being executed. These test cases execute some lines of the given program and determine whether the underlying tests are passed or failed. In particular, some test cases may be well-behaved (i.e., passed) while executing faulty statements. These test cases, also known as coincidentally correct test cases, may negatively influence the performance of the spectra-based fault localization and thus be less helpful as a tool for the purpose of automated debugging. In other words, the involvement of these coincidentally correct test cases may introduce noises to the fault localization computation and thus cause in divergence of effectively localizing the location of possible bugs in the given code. In this paper, we propose a hybrid approach of ensemble learning combined with a supervised learning algorithm namely, Random Forests (RF) for the purpose of correctly identifying test cases that are mislabeled to be the passing test cases. A cost-effective analysis of flipping the test status or trimming (i.e., eliminating from the computation) the coincidental correct test cases is also reported. Many studies and research have been conducted on improving the performance of coverage-based fault localization in spotting faults' location in programs.


ResOT: Resource-Efficient Oblique Trees for Neural Signal Classification

arXiv.org Machine Learning

Classifiers that can be implemented on chip with minimal computational and memory resources are essential for edge computing in emerging applications such as medical and IoT devices. This paper introduces a machine learning model based on oblique decision trees to enable resource-efficient classification on a neural implant. By integrating model compression with probabilistic routing and implementing cost-aware learning, our proposed model could significantly reduce the memory and hardware cost compared to state-of-the-art models, while maintaining the classification accuracy. We trained the resource-efficient oblique tree with power-efficient regularization (ResOT-PE) on three neural classification tasks to evaluate the performance, memory, and hardware requirements. On seizure detection task, we were able to reduce the model size by 3.4X and the feature extraction cost by 14.6X compared to the ensemble of boosted trees, using the intracranial EEG from 10 epilepsy patients. In a second experiment, we tested the ResOT-PE model on tremor detection for Parkinson's disease, using the local field potentials from 12 patients implanted with a deep-brain stimulation (DBS) device. We achieved a comparable classification performance as the state-of-the-art boosted tree ensemble, while reducing the model size and feature extraction cost by 10.6X and 6.8X, respectively. We also tested on a 6-class finger movement detection task using ECoG recordings from 9 subjects, reducing the model size by 17.6X and feature computation cost by 5.1X. The proposed model can enable a low-power and memory-efficient implementation of classifiers for real-time neurological disease detection and motor decoding.


Generalizing Gain Penalization for Feature Selection in Tree-based Models

arXiv.org Machine Learning

We develop a new approach for feature selection via gain penalization in tree-based models. First, we show that previous methods do not perform sufficient regularization and often exhibit sub-optimal out-of-sample performance, especially when correlated features are present. Instead, we develop a new gain penalization idea that exhibits a general local-global regularization for tree-based models. The new method allows for more flexibility in the choice of feature-specific importance weights. We validate our method on both simulated and real data and implement itas an extension of the popular R package ranger.


The Backbone Method for Ultra-High Dimensional Sparse Machine Learning

arXiv.org Machine Learning

We present the backbone method, a generic framework that enables sparse and interpretable supervised machine learning methods to scale to ultra-high dimensional problems. We solve, in minutes, sparse regression problems with $p\sim10^7$ features and decision tree induction problems with $p\sim10^5$ features. The proposed method operates in two phases; we first determine the backbone set, that consists of potentially relevant features, by solving a number of tractable subproblems; then, we solve a reduced problem, considering only the backbone features. Numerical experiments demonstrate that our method competes with optimal solutions, when exact methods apply, and substantially outperforms baseline heuristics, when exact methods do not scale, both in terms of recovering the true relevant features and in its out-of-sample predictive performance.


Volkswagen will start delivering ID.3 EVs to Europe in September

Engadget

Volkswagen's ID.3 EV finally has an official release date. Consumers in most European countries will be able to put a deposit on the EV starting on June 17th. Those who do will have two delivery options. They'll be able to get the EV either as soon as possible or later in the year. The earliest deliveries will take place in September, with the latter ones scheduled for Q4 2020.


Interpretable Random Forests via Rule Extraction

arXiv.org Artificial Intelligence

We introduce SIRUS (Stable and Interpretable RUle Set) for regression, a stable rule learning algorithm which takes the form of a short and simple list of rules. State-of-the-art learning algorithms are often referred to as ''black boxes'' because of the high number of operations involved in their prediction process. Despite their powerful predictivity, this lack of interpretability may be highly restrictive for applications with critical decisions at stake. On the other hand, algorithms with a simple structure-typically decision trees, rule algorithms, or sparse linear models-are well known for their instability. This undesirable feature makes the conclusions of the data analysis unreliable and turns out to be a strong operational limitation. This motivates the design of SIRUS, which combines a simple structure with a remarkable stable behavior when data is perturbed. The algorithm is based on random forests, the predictive accuracy of which is preserved. We demonstrate the efficiency of the method both empirically (through experiments) and theoretically (with the proof of its asymptotic stability). Our R/C++ software implementation sirus is available from CRAN.


Propositionalization and Embeddings: Two Sides of the Same Coin

arXiv.org Machine Learning

Data preprocessing is an important component of machine learning pipelines, which requires ample time and resources. An integral part of preprocessing is data transformation into the format required by a given learning algorithm. This paper outlines some of the modern data processing techniques used in relational learning that enable data fusion from different input data types and formats into a single table data representation, focusing on the propositionalization and embedding data transformation approaches. While both approaches aim at transforming data into tabular data format, they use different terminology and task definitions, are perceived to address different goals, and are used in different contexts. This paper contributes a unifying framework that allows for improved understanding of these two data transformation techniques by presenting their unified definitions, and by explaining the similarities and differences between the two approaches as variants of a unified complex data transformation task. In addition to the unifying framework, the novelty of this paper is a unifying methodology combining propositionalization and embeddings, which benefits from the advantages of both in solving complex data transformation and learning tasks. We present two efficient implementations of the unifying methodology: an instance-based PropDRM approach, and a feature-based PropStar approach to data transformation and learning, together with their empirical evaluation on several relational problems. The results show that the new algorithms can outperform existing relational learners and can solve much larger problems.


Nonparametric Feature Impact and Importance

arXiv.org Machine Learning

Practitioners use feature importance to rank and eliminate weak predictors during model development in an effort to simplify models and improve generality. Unfortunately, they also routinely conflate such feature importance measures with feature impact, the isolated effect of an explanatory variable on the response variable. This can lead to real-world consequences when importance is inappropriately interpreted as impact for business or medical insight purposes. The dominant approach for computing importances is through interrogation of a fitted model, which works well for feature selection, but gives distorted measures of feature impact. The same method applied to the same data set can yield different feature importances, depending on the model, leading us to conclude that impact should be computed directly from the data. While there are nonparametric feature selection algorithms, they typically provide feature rankings, rather than measures of impact or importance. They also typically focus on single-variable associations with the response. In this paper, we give mathematical definitions of feature impact and importance, derived from partial dependence curves, that operate directly on the data. To assess quality, we show that features ranked by these definitions are competitive with existing feature selection techniques using three real data sets for predictive tasks.