Decision Tree Learning
MC2: Secure Collaborative Analytics for Machine Learning
Machine Learning (ML) has gained prominence in recent years because of its ability to be applied across scores of industries and solve complex problems effectively. Yet, research shows that nearly 90% of AI/ML models never actually make it into production or hit the market. The main challenge is that ML/AI models require huge volumes of high-quality, accurate, and timely data to be effective, but organizations have long been reluctant to share sensitive information due to security and privacy concerns. Personal data is becoming more pervasive, causing privacy concerns to grow. As a result, global data protection laws have become stricter, and organizations face increasingly higher noncompliance risks. Mitigating such concerns and taking AI/ML to the next level requires a new approach to collaboration -- secure collaborative learning.
Review on Classification Techniques used in Biophysiological Stress Monitoring
Iqbal, Talha, Elahi, Adnan, Shahzad, Atif, Wijns, William
Cardiovascular activities are directly related to the response of a body in a stressed condition. Stress, based on its intensity, can be divided into two types i.e. Acute stress (short-term stress) and Chronic stress (long-term stress). Repeated acute stress and continuous chronic stress may play a vital role in inflammation in the circulatory system and thus leads to a heart attack or to a stroke. In this study, we have reviewed commonly used machine learning classification techniques applied to different stress-indicating parameters used in stress monitoring devices. These parameters include Photoplethysmograph (PPG), Electrocardiographs (ECG), Electromyograph (EMG), Galvanic Skin Response (GSR), Heart Rate Variation (HRV), skin temperature, respiratory rate, Electroencephalograph (EEG) and salivary cortisol, used in stress monitoring devices. This study also provides a discussion on choosing a classifier, which depends upon a number of factors other than accuracy, like the number of subjects involved in an experiment, type of signals processing and computational limitations.
[2210.14518v1] Which Factors Matter Most? Can Startup Valuation be Micro-Targeted?
While startup valuations are influenced by revenues, risks, age, and macroeconomic conditions, specific causality is traditionally a black box. Because valuations are not disclosed, roles played by other factors (industry, geography, and intellectual property) can often only be guessed at. VC valuation research indicates the importance of establishing a factor-hierarchy to better understand startup valuations and their dynamics, suggesting the wisdom of hiring data-scientists for this purpose. Bespoke understanding can be established via construction of hierarchical prediction models based on decision trees and random forests. These have the advantage of understanding which factors matter most. In combination with OLS, the also tell us the circumstances of when specific causalities apply. This study explores the deterministic role of categorical variables on the valuation of start-ups (i.e. the joint-combination geographic, urban, and sectoral denomination-variables), in order to be able to build a generalized valuation scorecard approach. Using a dataset of 1,091 venture-capital investments, containing 1,044 unique EU and EEA, this study examines microeconomic, sectoral, and local-level impacts on startup valuation. In principle, the study relies on Fixedeffects and Joint-fixed-effects regressions as well as the analysis and exploration of divergent micropopulations and fault-lines by means of non-parametric approaches combining econometric and machinelearning techniques.
Exploring the Whole Rashomon Set of Sparse Decision Trees
Xin, Rui, Zhong, Chudi, Chen, Zhi, Takagi, Takuya, Seltzer, Margo, Rudin, Cynthia
In any given machine learning problem, there might be many models that explain the data almost equally well. However, most learning algorithms return only one of these models, leaving practitioners with no practical way to explore alternative models that might have desirable properties beyond what could be expressed by a loss function. The Rashomon set is the set of these all almost-optimal models. Rashomon sets can be large in size and complicated in structure, particularly for highly nonlinear function classes that allow complex interaction terms, such as decision trees. We provide the first technique for completely enumerating the Rashomon set for sparse decision trees; in fact, our work provides the first complete enumeration of any Rashomon set for a non-trivial problem with a highly nonlinear discrete function class. This allows the user an unprecedented level of control over model choice among all models that are approximately equally good. We represent the Rashomon set in a specialized data structure that supports efficient querying and sampling. We show three applications of the Rashomon set: 1) it can be used to study variable importance for the set of almost-optimal trees (as opposed to a single tree), 2) the Rashomon set for accuracy enables enumeration of the Rashomon sets for balanced accuracy and F1-score, and 3) the Rashomon set for a full dataset can be used to produce Rashomon sets constructed with only subsets of the data set. Thus, we are able to examine Rashomon sets across problems with a new lens, enabling users to choose models rather than be at the mercy of an algorithm that produces only a single model.
Fast Optimization of Weighted Sparse Decision Trees for use in Optimal Treatment Regimes and Optimal Policy Design
Behrouz, Ali, Lecuyer, Mathias, Rudin, Cynthia, Seltzer, Margo
Sparse decision trees are one of the most common forms of interpretable models. While recent advances have produced algorithms that fully optimize sparse decision trees for prediction, that work does not address policy design, because the algorithms cannot handle weighted data samples. Specifically, they rely on the discreteness of the loss function, which means that real-valued weights cannot be directly used. For example, none of the existing techniques produce policies that incorporate inverse propensity weighting on individual data points. We present three algorithms for efficient sparse weighted decision tree optimization. The first approach directly optimizes the weighted loss function; however, it tends to be computationally inefficient for large datasets. Our second approach, which scales more efficiently, transforms weights to integer values and uses data duplication to transform the weighted decision tree optimization problem into an unweighted (but larger) counterpart. Our third algorithm, which scales to much larger datasets, uses a randomized procedure that samples each data point with a probability proportional to its weight. We present theoretical bounds on the error of the two fast methods and show experimentally that these methods can be two orders of magnitude faster than the direct optimization of the weighted loss, without losing significant accuracy.
No imputation without representation
Lenz, Oliver Urs, Peralta, Daniel, Cornelis, Chris
Imputation allows datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used to preserve this information. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. We find that missing-indicators generally increase classification performance, and that nearest neighbour and iterative imputation do not lead to better performance than simple mean/mode imputation. Therefore, we recommend the use of missing-indicators with mean/mode imputation as a safe default, with the caveat that for decision trees, pruning is necessary to prevent overfitting.
Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees
Ponti, Moacir Antonelli, Oliveira, Lucas de Angelis, Romรกn, Juan Martรญn, Argerich, Luis
Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. We show results on detecting noisy labels in order to either remove them, improving models' metrics in synthetic and real datasets, as well as a productive dataset. Our methods achieved the best results overall when compared with confident learning and heuristics.
Comparing Machine Learning Techniques for Alfalfa Biomass Yield Prediction
Vance, Jonathan, Rasheed, Khaled, Missaoui, Ali, Maier, Frederick, Adkins, Christian, Whitmire, Chris
The alfalfa crop is globally important as livestock feed, so highly efficient planting and harvesting could benefit many industries, especially as the global climate changes and traditional methods become less accurate. Recent work using machine learning (ML) to predict yields for alfalfa and other crops has shown promise. Previous efforts used remote sensing, weather, planting, and soil data to train machine learning models for yield prediction. However, while remote sensing works well, the models require large amounts of data and cannot make predictions until the harvesting season begins. Using weather and planting data from alfalfa variety trials in Kentucky and Georgia, our previous work compared feature selection techniques to find the best technique and best feature set. In this work, we trained a variety of machine learning models, using cross validation for hyperparameter optimization, to predict biomass yields, and we showed better accuracy than similar work that employed more complex techniques. Our best individual model was a random forest with a mean absolute error of 0.081 tons/acre and R{$^2$} of 0.941. Next, we expanded this dataset to include Wisconsin and Mississippi, and we repeated our experiments, obtaining a higher best R{$^2$} of 0.982 with a regression tree. We then isolated our testing datasets by state to explore this problem's eligibility for domain adaptation (DA), as we trained on multiple source states and tested on one target state. This Trivial DA (TDA) approach leaves plenty of room for improvement through exploring more complex DA techniques in forthcoming work.
Distributional Adaptive Soft Regression Trees
Umlauf, Nikolaus, Klein, Nadja
Random forests are an ensemble method relevant for many problems, such as regression or classification. They are popular due to their good predictive performance (compared to, e.g., decision trees) requiring only minimal tuning of hyperparameters. They are built via aggregation of multiple regression trees during training and are usually calculated recursively using hard splitting rules. Recently regression forests have been incorporated into the framework of distributional regression, a nowadays popular regression approach aiming at estimating complete conditional distributions rather than relating the mean of an output variable to input features only - as done classically. This article proposes a new type of a distributional regression tree using a multivariate soft split rule. One great advantage of the soft split is that smooth high-dimensional functions can be estimated with only one tree while the complexity of the function is controlled adaptive by information criteria. Moreover, the search for the optimal split variable is obsolete. We show by means of extensive simulation studies that the algorithm has excellent properties and outperforms various benchmark methods, especially in the presence of complex non-linear feature interactions. Finally, we illustrate the usefulness of our approach with an example on probabilistic forecasts for the Sun's activity.
Robust Trees for Security
Tree models are widely used for security, such as detecting malicious autonomous system, social engineering, malware distribution, phishing emails, advertising resources for ad blocker, and online scams, etc. Despite their popularity, the robustness of tree models has not been thoroughly studied in the context of security applications. In this post, I will show how to train robust trees to detect Twitter spam. Our most exciting result is that we can increase the feature manipulation cost for adaptive attackers to evade the robust tree ensemble by 10.6X. We used the dataset from Kwon et al. and re-extracted 25 features.