Decision Tree Learning
Locally Optimized Random Forests
Coleman, Tim, Kaufeld, Kimberly, Dorn, Mary Frances, Mentch, Lucas
Standard supervised learning procedures are validated against a test set that is assumed to have come from the same distribution as the training data. However, in many problems, the test data may have come from a different distribution. We consider the case of having many labeled observations from one distribution, $P_1$, and making predictions at unlabeled points that come from $P_2$. We combine the high predictive accuracy of random forests (Breiman, 2001) with an importance sampling scheme, where the splits and predictions of the base-trees are done in a weighted manner, which we call Locally Optimized Random Forests. These weights correspond to a non-parametric estimate of the likelihood ratio between the training and test distributions. To estimate these ratios with an unlabeled test set, we make the covariate shift assumption, where the differences in distribution are only a function of the training distributions (Shimodaira, 2000.) This methodology is motivated by the problem of forecasting power outages during hurricanes. The extreme nature of the most devastating hurricanes means that typical validation set ups will overly favor less extreme storms. Our method provides a data-driven means of adapting a machine learning method to deal with extreme events.
E-MIIM: An Ensemble Learning based Context-Aware Mobile Telephony Model for Intelligent Interruption Management
Sarker, Iqbal H., Kayes, A. S. M., Furhad, Md Hasan, Islam, Mohammad Mainul, Islam, Md Shohidul
Nowadays, mobile telephony interruptions in our daily life activities are common because of the inappropriate ringing notifications of incoming phone calls in different contexts. Such interruptions may impact on the work attention not only for the mobile phone owners but also the surrounding people. Decision tree is the most popular machine learning classification technique that is used in existing context-aware mobile intelligent interruption management (MIIM) model to overcome such issues. However, a single decision tree based context-aware model may cause overfitting problem and thus decrease the prediction accuracy of the inferred model. Therefore, in this paper, we propose an ensemble machine learning based context-aware mobile telephony model for the purpose of intelligent interruption management by taking into account multi-dimensional contexts and name it "E-MIIM". The experimental results on individuals' real life mobile telephony datasets show that our E-MIIM model is more effective and outperforms existing MIIM model for predicting and managing individual's mobile telephony interruptions based on their relevant contextual information.
Investigation of wind pressures on tall building under interference effects using machine learning techniques
Hu, Gang, Liu, Lingbo, Tao, Dacheng, Song, Jie, Kwok, K. C. S.
Interference effects of tall buildings have attracted numerous studies due to the boom of clusters of tall buildings in megacities. To fully understand the interference effects of buildings, it often requires a substantial amount of wind tunnel tests. Limited wind tunnel tests that only cover part of interference scenarios are unable to fully reveal the interference effects. This study used machine learning techniques to resolve the conflicting requirement between limited wind tunnel tests that produce unreliable results and a completed investigation of the interference effects that is costly and time-consuming. Four machine learning models including decision tree, random forest, XGBoost, generative adversarial networks (GANs), were trained based on 30% of a dataset to predict both mean and fluctuating pressure coefficients on the principal building. The GANs model exhibited the best performance in predicting these pressure coefficients. A number of GANs models were then trained based on different portions of the dataset ranging from 10% to 90%. It was found that the GANs model based on 30% of the dataset is capable of predicting both mean and fluctuating pressure coefficients under unseen interference conditions accurately. By using this GANs model, 70% of the wind tunnel test cases can be saved, largely alleviating the cost of this kind of wind tunnel testing study.
TabNet: Attentive Interpretable Tabular Learning
Arik, Sercan O., Pfister, Tomas
We propose a novel high-performance interpretable deep tabular data learning network, TabNet. TabNet utilizes a sequential attention mechanism to choose which features to reason from at each decision step and then aggregates the processed information towards the final decision. Explicit selection of sparse features enables more efficient learning as the model capacity at each decision step is fully utilized for the most relevant features, and also more interpretable decision making via visualization of selection masks. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of tabular data learning datasets while yielding interpretable feature attributions and insights into the global model behavior.
Random Forests for Store Forecasting at Walmart Scale
The SMART Forecasting team at Walmart Labs is tasked with providing demand forecasts for over 70 million store-item combinations every week! For example, just how much of every type of ginger needs to go to every Walmart store in the U.S., every week for the next 52 weeks, with the goal of improving in stocks and reducing food waste. Our algorithm strategy was to build a suite of machine learning models and deploy them at scale to generate bespoke solutions for (oh so many!) store-item-week combinations. Random Forests would be part of this suite. We went through the traditional model development workflow of data discovery, identifying demand drivers, feature engineering, training, cross validation and testing.
SIRUS: making random forests interpretable
Bรฉnard, Clรฉment, Biau, Gรฉrard, da Veiga, Sรฉbastien, Scornet, Erwan
State-of-the-art learning algorithms, such as random forests or neural networks, are often qualified as "black-boxes" because of the high number and complexity of operations involved in their prediction mechanism. This lack of interpretability is a strong limitation for applications involving critical decisions, typically the analysis of production processes in the manufacturing industry. In such critical contexts, models have to be interpretable, i.e., simple, stable, and predictive. To address this issue, we design SIRUS (Stable and In-terpretable RUle Set), a new classification algorithm based on random forests, which takes the form of a short list of rules. While simple models are usually unstable with respect to data perturbation, SIRUS achieves a remarkable stability improvement over cutting-edge methods. Furthermore, SIRUS inherits a predictive accuracy close to random forests, combined with the simplicity of decision trees. These properties are assessed both from a theoretical and empirical point of view, through extensive numerical experiments based on our R/C++ software implementation sirus.
AI Predicts Independent Construction Safety Outcomes from Universal Attributes
Baker, Henrietta, Hallowell, Matthew R., Tixier, Antoine J. -P.
These pro-3 grams rely on patterns and inference, rather than explicit instructions, to achieve their aims [5]. ML in construction has been developed significantly since 1991 when [6] first discussed the potential of neural networks in construction engineering and management. Early examples of ML in construction include applications such as [7] where the AQ15 algorithm was applied to automatically learn the mapping between constructability (poor, good, excellent) and 7 predictors from a collection of 31 training examples; and [8] who applied decision trees and neural networks to a construction management database to identify the causes of delays. Many subsequent prediction applications applied support vector machines (SVMs), owing to their consistently high accuracy. These applications include [9], who accurately forecasted contractor prequalification using input variables such as financial strength and current workload; [10], who estimated building cost and loss risk from ten input variables; and [11], who detected concrete structural components in color images from actual construction sites. In the last 5 years, use of ML in construction has become far more widespread and the methods and applications used are far more diverse.
Uplift Modeling for Multiple Treatments with Cost Optimization
--Uplift modeling is an emerging machine learning approach for estimating the treatment effect at an individual or subgroup level. It can be used for optimizing the performance of interventions such as marketing campaigns and product designs. Uplift modeling can be used to estimate which users are likely to benefit from a treatment and then prioritize delivering or promoting the preferred experience to those users. An important but so far neglected use case for uplift modeling is an experiment with multiple treatment groups that have different costs, such as for example when different communication channels and promotion types are tested simultaneously. In this paper, we extend standard uplift models to support multiple treatment groups with different costs. We evaluate the performance of the proposed models using both synthetic and real data. We also describe a production implementation of the approach. Uplift modeling [1]-[8] is a technique to estimate and predict the individual-level or subgroup-level causal effects of different treatments in an experiment. This type of information is useful for designing and offering a personalized experience to improve user experience, satisfaction, and engagement. Uplift modeling is therefore commonly used in areas such as marketing, customer service, and product offering. It is helpful to think about uplift modeling in the context of randomized experiments (also known as A/B testing [9]-[11]). In a typical experiment, users are randomly assigned to each treatment group and causal effects are then estimated for the population.
r/MachineLearning - [P] Updates to Incredicat, my attempt at a 20 questions style game powered by Cat AI
I posted this a few months ago and had some great feedback. I've put some work into the model and have just released the latest update. It uses a modified version of C4.5 decision trees and a load of other adjustments. Think it is working better now after some changes around the classification process.
Detecting Heterogeneous Treatment Effect with Instrumental Variables
Johnson, Michael, Cao, Jiongyi, Kang, Hyunseung
There is an increasing interest in estimating heterogeneity in causal effects in randomized and observational studies. However, little research has been conducted to understand heterogeneity in an instrumental variables study. In this work, we present a method to estimate heterogeneous causal effects using an instrumental variable approach. The method has two parts. The first part uses subject-matter knowledge and interpretable machine learning techniques, such as classification and regression trees, to discover potential effect modifiers. The second part uses closed testing to test for the statistical significance of the effect modifiers while strongly controlling familywise error rate. We conducted this method on the Oregon Health Insurance Experiment, estimating the effect of Medicaid on the number of days an individual's health does not impede their usual activities, and found evidence of heterogeneity in older men who prefer English and don't self-identify as Asian and younger individuals who have at most a high school diploma or GED and prefer English.