Ensemble Learning
A generalized decision tree ensemble based on the NeuralNetworks architecture: Distributed Gradient Boosting Forest (DGBF)
Delgado-Panadero, Ángel, Benítez-Andrades, José Alberto, García-Ordás, María Teresa
Tree ensemble algorithms as RandomForest and GradientBoosting are currently the dominant methods for modeling discrete or tabular data, however, they are unable to perform a hierarchical representation learning from raw data as NeuralNetworks does thanks to its multi-layered structure, which is a key feature for DeepLearning problems and modeling unstructured data. This limitation is due to the fact that tree algorithms can not be trained with back-propagation because of their mathematical nature. However, in this work, we demonstrate that the mathematical formulation of bagging and boosting can be combined together to define a graph-structured-tree-ensemble algorithm with a distributed representation learning process between trees naturally (without using back-propagation). We call this novel approach Distributed Gradient Boosting Forest (DGBF) and we demonstrate that both RandomForest and GradientBoosting can be expressed as particular graph architectures of DGBT. Finally, we see that the distributed learning outperforms both RandomForest and GradientBoosting in 7 out of 9 datasets.
Exploring the Effects of Population and Employment Characteristics on Truck Flows: An Analysis of NextGen NHTS Origin-Destination Data
Uddin, Majbah, Liu, Yuandong, Lim, Hyeonsup
Truck transportation remains the dominant mode of US freight transportation because of its advantages, such as the flexibility of accessing pickup and drop-off points and faster delivery. Because of the massive freight volume transported by trucks, understanding the effects of population and employment characteristics on truck flows is critical for better transportation planning and investment decisions. The US Federal Highway Administration published a truck travel origin-destination data set as part of the Next Generation National Household Travel Survey program. This data set contains the total number of truck trips in 2020 within and between 583 predefined zones encompassing metropolitan and nonmetropolitan statistical areas within each state and Washington, DC. In this study, origin-destination-level truck trip flow data was augmented to include zone-level population and employment characteristics from the US Census Bureau. Census population and County Business Patterns data were included. The final data set was used to train a machine learning algorithm-based model, Extreme Gradient Boosting (XGBoost), where the target variable is the number of total truck trips. Shapley Additive ExPlanation (SHAP) was adopted to explain the model results. Results showed that the distance between the zones was the most important variable and had a nonlinear relationship with truck flows.
Combining T-learning and DR-learning: a framework for oracle-efficient estimation of causal contrasts
van der Laan, Lars, Carone, Marco, Luedtke, Alex
We introduce efficient plug-in (EP) learning, a novel framework for the estimation of heterogeneous causal contrasts, such as the conditional average treatment effect and conditional relative risk. The EP-learning framework enjoys the same oracle-efficiency as Neyman-orthogonal learning strategies, such as DR-learning and R-learning, while addressing some of their primary drawbacks, including that (i) their practical applicability can be hindered by loss function non-convexity; and (ii) they may suffer from poor performance and instability due to inverse probability weighting and pseudo-outcomes that violate bounds. To avoid these drawbacks, EP-learner constructs an efficient plug-in estimator of the population risk function for the causal contrast, thereby inheriting the stability and robustness properties of plug-in estimation strategies like T-learning. Under reasonable conditions, EP-learners based on empirical risk minimization are oracle-efficient, exhibiting asymptotic equivalence to the minimizer of an oracle-efficient one-step debiased estimator of the population risk function. In simulation experiments, we illustrate that EP-learners of the conditional average treatment effect and conditional relative risk outperform state-of-the-art competitors, including T-learner, R-learner, and DR-learner. Open-source implementations of the proposed methods are available in our R package hte3.
Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers
Curth, Alicia, Jeffares, Alan, van der Schaar, Mihaela
Despite their remarkable effectiveness and broad application, the drivers of success underlying ensembles of trees (especially random forests and gradient boosting) are still not fully understood. In this paper, we highlight how interpreting tree ensembles as adaptive and self-regularizing smoothers can provide new intuition and deeper insight to this topic. We use this perspective to show that, when studied as smoothers - whose predictions can be understood as simple weighted averages of the training labels - randomized tree ensembles not only make predictions that are quantifiably more smooth than the predictions of the individual trees they consist of, but also further regulate their smoothness at test-time based on the dissimilarity between testing and training inputs. First, we use this insight to revisit, refine and reconcile two recent explanations of forest success by providing a new way of quantifying the conjectured behaviors of tree ensembles objectively by measuring the effective degree of smoothing they imply. Then, we move beyond existing explanations for the mechanisms by which tree ensembles improve upon individual trees and challenge the popular wisdom that the superior performance of forests should be understood as a consequence of variance reduction alone. We argue that the current high-level dichotomy into bias-and variance-reduction prevalent in statistics is insufficient to understand tree ensembles - because the prevailing definition of bias does not capture differences in the expressivity of the hypothesis classes formed by trees and forests. Instead, we show that forests can improve upon trees by three distinct mechanisms that are usually implicitly entangled. In particular, we demonstrate that the smoothing effect of ensembling can reduce variance in predictions due to noise in outcome generation, reduce variability in the quality of the learned function given fixed input data and reduce potential bias in learnable functions by enriching the available hypothesis space.
Beyond the Request: Harnessing HTTP Response Headers for Cross-Browser Web Tracker Classification in an Imbalanced Setting
Rieder, Wolf, Raschke, Philip, Cory, Thomas
The World Wide Web's connectivity is greatly attributed to the HTTP protocol, with HTTP messages offering informative header fields that appeal to disciplines like web security and privacy, especially concerning web tracking. Despite existing research employing HTTP/S request messages to identify web trackers, HTTP/S response headers are often overlooked. This study endeavors to design effective machine learning classifiers for web tracker detection using HTTP/S response headers. Data from the Chrome, Firefox, and Brave browsers, obtained through the traffic monitoring browser extension T.EX, serves as our data set. Eleven supervised models were trained on Chrome data and tested across all browsers. The results demonstrated high accuracy, F1-score, precision, recall, and minimal log-loss error for Chrome and Firefox, but subpar performance on Brave, potentially due to its distinct data distribution and feature set. The research suggests that these classifiers are viable for detecting web trackers in Chrome and Firefox. However, real-world application testing remains pending, and the distinction between tracker types and broader label sources could be explored in future studies.
Comparative Evaluation of Weather Forecasting using Machine Learning Models
Rahman, Md Saydur, Tumpa, Farhana Akter, Islam, Md Shazid, Arabi, Abul Al, Hossain, Md Sanzid Bin, Haque, Md Saad Ul
Gaining a deeper understanding of weather and being able to predict its future conduct have always been considered important endeavors for the growth of our society. This research paper explores the advancements in understanding and predicting nature's behavior, particularly in the context of weather forecasting, through the application of machine learning algorithms. By leveraging the power of machine learning, data mining, and data analysis techniques, significant progress has been made in this field. This study focuses on analyzing the contributions of various machine learning algorithms in predicting precipitation and temperature patterns using a 20-year dataset from a single weather station in Dhaka city. Algorithms such as Gradient Boosting, AdaBoosting, Artificial Neural Network, Stacking Random Forest, Stacking Neural Network, and Stacking KNN are evaluated and compared based on their performance metrics, including Confusion matrix measurements. The findings highlight remarkable achievements and provide valuable insights into their performances and features correlation.
Random Forest-Based Prediction of Stroke Outcome
Fernandez-Lozano, Carlos, Hervella, Pablo, Mato-Abad, Virginia, Rodriguez-Yanez, Manuel, Suarez-Garaboa, Sonia, Lopez-Dequidt, Iria, Estany-Gestal, Ana, Sobrino, Tomas, Campos, Francisco, Castillo, Jose, Rodriguez-Yanez, Santiago, Iglesias-Rey, Ramon
We research into the clinical, biochemical and neuroimaging factors associated with the outcome of stroke patients to generate a predictive model using machine learning techniques for prediction of mortality and morbidity 3 months after admission. The dataset consisted of patients with ischemic stroke (IS) and non-traumatic intracerebral hemorrhage (ICH) admitted to Stroke Unit of a European Tertiary Hospital prospectively registered. We identified the main variables for machine learning Random Forest (RF), generating a predictive model that can estimate patient mortality/morbidity. In conclusion, machine learning algorithms RF can be effectively used in stroke patients for long-term outcome prediction of mortality and morbidity.
Dynamic Model Switching for Improved Accuracy in Machine Learning
In the dynamic landscape of machine learning, where datasets vary widely in size and complexity, selecting the most effective model poses a significant challenge. Rather than fixating on a single model, our research propels the field forward with a novel emphasis on dynamic model switching. This paradigm shift allows us to harness the inherent strengths of different models based on the evolving size of the dataset. Consider the scenario where CatBoost demonstrates exceptional efficacy in handling smaller datasets, providing nuanced insights and accurate predictions. However, as datasets grow in size and intricacy, XGBoost, with its scalability and robustness, becomes the preferred choice. Our approach introduces an adaptive ensemble that intuitively transitions between CatBoost and XGBoost. This seamless switching is not arbitrary; instead, it's guided by a user-defined accuracy threshold, ensuring a meticulous balance between model sophistication and data requirements. The user sets a benchmark, say 80% accuracy, prompting the system to dynamically shift to the new model only if it guarantees improved performance. This dynamic model-switching mechanism aligns with the evolving nature of data in real-world scenarios. It offers practitioners a flexible and efficient solution, catering to diverse dataset sizes and optimising predictive accuracy at every juncture. Our research, therefore, stands at the forefront of innovation, redefining how machine learning models adapt and excel in the face of varying dataset dynamics.
Federated unsupervised random forest for privacy-preserving patient stratification
Pfeifer, Bastian, Sirocchi, Christel, Bloice, Marcus D., Kreuzthaler, Markus, Urschler, Martin
In the realm of precision medicine, effective patient stratification and disease subtyping demand innovative methodologies tailored for multi-omics data. Clustering techniques applied to multi-omics data have become instrumental in identifying distinct subgroups of patients, enabling a finer-grained understanding of disease variability. This work establishes a powerful framework for advancing precision medicine through unsupervised random-forest-based clustering and federated computing. We introduce a novel multi-omics clustering approach utilizing unsupervised random-forests. The unsupervised nature of the random forest enables the determination of cluster-specific feature importance, unraveling key molecular contributors to distinct patient groups. Moreover, our methodology is designed for federated execution, a crucial aspect in the medical domain where privacy concerns are paramount. We have validated our approach on machine learning benchmark data sets as well as on cancer data from The Cancer Genome Atlas (TCGA). Our method is competitive with the state-of-the-art in terms of disease subtyping, but at the same time substantially improves the cluster interpretability. Experiments indicate that local clustering performance can be improved through federated computing.
Empirical and Experimental Insights into Machine Learning-Based Defect Classification in Semiconductor Wafers
This survey paper offers a comprehensive review of methodologies utilizing machine learning (ML) classification techniques for identifying wafer defects in semiconductor manufacturing. Despite the growing body of research demonstrating the effectiveness of ML in wafer defect identification, there is a noticeable absence of comprehensive reviews on this subject. This survey attempts to fill this void by amalgamating available literature and providing an in-depth analysis of the advantages, limitations, and potential applications of various ML classification algorithms in the realm of wafer defect detection. An innovative taxonomy of methodologies that we present provides a detailed classification of algorithms into more refined categories and techniques. This taxonomy follows a three-tier structure, starting from broad methodology categories and ending with specific techniques. It aids researchers in comprehending the complex relationships between different algorithms and their techniques. We employ a rigorous empirical and experimental evaluation to rank these varying techniques. For the empirical evaluation, we assess techniques based on a set of five criteria. The experimental evaluation ranks the algorithms employing the same techniques, sub-categories, and categories. Also the paper illuminates the future prospects of ML classification techniques for wafer defect identification, underscoring potential advancements and opportunities for further research in this field