Goto

Collaborating Authors

 Ensemble Learning


Small Area Estimation with Random Forests and the LASSO

arXiv.org Machine Learning

We consider random forests and LASSO methods for model-based small area estimation when the number of areas with sampled data is a small fraction of the total areas for which estimates are required. Abundant auxiliary information is available for the sampled areas, from the survey, and for all areas, from an exterior source, and the goal is to use auxiliary variables to predict the outcome of interest. We compare areallevel random forests and LASSO approaches to a frequentist forward variable selection approach and a Bayesian shrinkage method. This work is motivated by Ghanaian data available from the sixth Living Standard Survey (GLSS) and the 2010 Population and Housing Census. We estimate the areal mean household log consumption using both datasets. The outcome variable is measured only in the GLSS for 3% of all the areas (136 out of 5019) and more than 170 potential covariates are available from both datasets. Among the four modelling methods considered, the Bayesian shrinkage performed the best in terms of bias, MSE and prediction interval coverages and scores, as assessed through a cross-validation study. We find substantial between-area variation, the log consumption areal point estimates showing a 1.3-fold variation across the GAMA region. The western areas are the poorest while the Accra Metropolitan Area district gathers the richest areas. In 2015, the United Nations (UN) released their 2030 agenda for sustainable development goals (SDGs) consisting of 17 goals, the first of which was to end poverty worldwide (Resolution, General Assembly and others, 2015). For their first SDG, the UN made seven guidelines explicit, including the implementation of "poverty eradication policies" at a disaggregated level. To that end, producing reliable and fine-grained pictures of socioeconomic status and income inequality is fundamental to help decision makers prioritise and target certain areas. These detailed maps help local communities understand their situation compared to their neighbours, which also helps when planning interventions (Bedi et al., 2007). In Ghana, household surveys are collected every few years to measure the living conditions of households across Ghanaian regions and districts and to monitor poverty.


On the Robustness of Random Forest Against Untargeted Data Poisoning: An Ensemble-Based Approach

arXiv.org Artificial Intelligence

Machine learning is becoming ubiquitous. From finance to medicine, machine learning models are boosting decision-making processes and even outperforming humans in some tasks. This huge progress in terms of prediction quality does not however find a counterpart in the security of such models and corresponding predictions, where perturbations of fractions of the training set (poisoning) can seriously undermine the model accuracy. Research on poisoning attacks and defenses received increasing attention in the last decade, leading to several promising solutions aiming to increase the robustness of machine learning. Among them, ensemble-based defenses, where different models are trained on portions of the training set and their predictions are then aggregated, provide strong theoretical guarantees at the price of a linear overhead. Surprisingly, ensemble-based defenses, which do not pose any restrictions on the base model, have not been applied to increase the robustness of random forest models. The work in this paper aims to fill in this gap by designing and implementing a novel hash-based ensemble approach that protects random forest against untargeted, random poisoning attacks. An extensive experimental evaluation measures the performance of our approach against a variety of attacks, as well as its sustainability in terms of resource consumption and performance, and compares it with a traditional monolithic model based on random forest. A final discussion presents our main findings and compares our approach with existing poisoning defenses targeting random forests.


AI in Thyroid Cancer Diagnosis: Techniques, Trends, and Future Directions

arXiv.org Artificial Intelligence

There has been a growing interest in creating intelligent diagnostic systems to assist medical professionals in analyzing and processing big data for the treatment of incurable diseases. One of the key challenges in this field is detecting thyroid cancer, where advancements have been made using machine learning (ML) and big data analytics to evaluate thyroid cancer prognosis and determine a patient's risk of malignancy. This review paper summarizes a large collection of articles related to artificial intelligence (AI)-based techniques used in the diagnosis of thyroid cancer. Accordingly, a new classification was introduced to classify these techniques based on the AI algorithms used, the purpose of the framework, and the computing platforms used. Additionally, this study compares existing thyroid cancer datasets based on their features. The focus of this study is on how AI-based tools can support the diagnosis and treatment of thyroid cancer, through supervised, unsupervised, or hybrid techniques. It also highlights the progress made and the unresolved challenges in this field. Finally, the future trends and areas of focus in this field are discussed.


Hyperbolic Random Forests

arXiv.org Artificial Intelligence

Hyperbolic space is becoming a popular choice for representing data due to the hierarchical structure - whether implicit or explicit - of many real-world datasets. Along with it comes a need for algorithms capable of solving fundamental tasks, such as classification, in hyperbolic space. Recently, multiple papers have investigated hyperbolic alternatives to hyperplane-based classifiers, such as logistic regression and SVMs. While effective, these approaches struggle with more complex hierarchical data. We, therefore, propose to generalize the well-known random forests to hyperbolic space. We do this by redefining the notion of a split using horospheres. Since finding the globally optimal split is computationally intractable, we find candidate horospheres through a large-margin classifier. To make hyperbolic random forests work on multi-class data and imbalanced experiments, we furthermore outline a new method for combining classes based on their lowest common ancestor and a class-balanced version of the large-margin loss. Experiments on standard and new benchmarks show that our approach outperforms both conventional random forest algorithms and recent hyperbolic classifiers.


Using Adamic-Adar Index Algorithm to Predict Volunteer Collaboration: Less is More

arXiv.org Artificial Intelligence

Social networks exhibit a complex graph-like structure due to the uncertainty surrounding potential collaborations among participants. Machine learning algorithms possess generic outstanding performance in multiple real-world prediction tasks. However, whether machine learning algorithms outperform specific algorithms designed for graph link prediction remains unknown to us. To address this issue, the Adamic-Adar Index (AAI), Jaccard Coefficient (JC) and common neighbour centrality (CNC) as representatives of graph-specific algorithms were applied to predict potential collaborations, utilizing data from volunteer activities during the Covid-19 pandemic in Shenzhen city, along with the classical machine learning algorithms such as random forest, support vector machine, and gradient boosting as single predictors and components of ensemble learning. This paper introduces that the AAI algorithm outperformed the traditional JC and CNC, and other machine learning algorithms in analyzing graph node attributes for this task.


On marginal feature attributions of tree-based models

arXiv.org Artificial Intelligence

Due to their power and ease of use, tree-based machine learning models, such as random forests and gradient-boosted tree ensembles, have become very popular. To interpret them, local feature attributions based on marginal expectations, e.g. marginal (interventional) Shapley, Owen or Banzhaf values, may be employed. Such methods are true to the model and implementation invariant, i.e. dependent only on the input-output function of the model. We contrast this with the popular TreeSHAP algorithm by presenting two (statistically similar) decision trees that compute the exact same function for which the "path-dependent" TreeSHAP yields different rankings of features, whereas the marginal Shapley values coincide. Furthermore, we discuss how the internal structure of tree-based models may be leveraged to help with computing their marginal feature attributions according to a linear game value. One important observation is that these are simple (piecewise-constant) functions with respect to a certain grid partition of the input space determined by the trained model. Another crucial observation, showcased by experiments with XGBoost, LightGBM and CatBoost libraries, is that only a portion of all features appears in a tree from the ensemble. Thus, the complexity of computing marginal Shapley (or Owen or Banzhaf) feature attributions may be reduced. This remains valid for a broader class of game values which we shall axiomatically characterize. A prime example is the case of CatBoost models where the trees are oblivious (symmetric) and the number of features in each of them is no larger than the depth. We exploit the symmetry to derive an explicit formula, with improved complexity and only in terms of the internal model parameters, for marginal Shapley (and Banzhaf and Owen) values of CatBoost models. This results in a fast, accurate algorithm for estimating these feature attributions.


Uncertainty and Explainable Analysis of Machine Learning Model for Reconstruction of Sonic Slowness Logs

arXiv.org Artificial Intelligence

Logs are valuable information for oil and gas fields as they help to determine the lithology of the formations surrounding the borehole and the location and reserves of subsurface oil and gas reservoirs. However, important logs are often missing in horizontal or old wells, which poses a challenge in field applications. In this paper, we utilize data from the 2020 machine learning competition of the SPWLA, which aims to predict the missing compressional wave slowness and shear wave slowness logs using other logs in the same borehole. We employ the NGBoost algorithm to construct an Ensemble Learning model that can predicate the results as well as their uncertainty. Furthermore, we combine the SHAP method to investigate the interpretability of the machine learning model. We compare the performance of the NGBosst model with four other commonly used Ensemble Learning methods, including Random Forest, GBDT, XGBoost, LightGBM. The results show that the NGBoost model performs well in the testing set and can provide a probability distribution for the prediction results. In addition, the variance of the probability distribution of the predicted log can be used to justify the quality of the constructed log. Using the SHAP explainable machine learning model, we calculate the importance of each input log to the predicted results as well as the coupling relationship among input logs. Our findings reveal that the NGBoost model tends to provide greater slowness prediction results when the neutron porosity and gamma ray are large, which is consistent with the cognition of petrophysical models. Furthermore, the machine learning model can capture the influence of the changing borehole caliper on slowness, where the influence of borehole caliper on slowness is complex and not easy to establish a direct relationship. These findings are in line with the physical principle of borehole acoustics.


Development and external validation of a lung cancer risk estimation tool using gradient-boosting

arXiv.org Artificial Intelligence

Lung cancer is a significant cause of mortality worldwide, emphasizing the importance of early detection for improved survival rates. In this study, we propose a machine learning (ML) tool trained on data from the PLCO Cancer Screening Trial and validated on the NLST to estimate the likelihood of lung cancer occurrence within five years. The study utilized two datasets, the PLCO (n=55,161) and NLST (n=48,595), consisting of comprehensive information on risk factors, clinical measurements, and outcomes related to lung cancer. Data preprocessing involved removing patients who were not current or former smokers and those who had died of causes unrelated to lung cancer. Additionally, a focus was placed on mitigating bias caused by censored data. Feature selection, hyper-parameter optimization, and model calibration were performed using XGBoost, an ensemble learning algorithm that combines gradient boosting and decision trees. The ML model was trained on the pre-processed PLCO dataset and tested on the NLST dataset. The model incorporated features such as age, gender, smoking history, medical diagnoses, and family history of lung cancer. The model was well-calibrated (Brier score=0.044). ROC-AUC was 82% on the PLCO dataset and 70% on the NLST dataset. PR-AUC was 29% and 11% respectively. When compared to the USPSTF guidelines for lung cancer screening, our model provided the same recall with a precision of 13.1% vs. 9.3% on the PLCO dataset and 3.2% vs. 3.1% on the NLST dataset. The developed ML tool provides a freely available web application for estimating the likelihood of developing lung cancer within five years. By utilizing risk factors and clinical data, individuals can assess their risk and make informed decisions regarding lung cancer screening. This research contributes to the efforts in early detection and prevention strategies, aiming to reduce lung cancer-related mortality rates.


Mixed-Integer Projections for Automated Data Correction of EMRs Improve Predictions of Sepsis among Hospitalized Patients

arXiv.org Artificial Intelligence

Machine learning (ML) models are increasingly pivotal in automating clinical decisions. Yet, a glaring oversight in prior research has been the lack of proper processing of Electronic Medical Record (EMR) data in the clinical context for errors and outliers. Addressing this oversight, we introduce an innovative projections-based method that seamlessly integrates clinical expertise as domain constraints, generating important meta-data that can be used in ML workflows. In particular, by using high-dimensional mixed-integer programs that capture physiological and biological constraints on patient vitals and lab values, we can harness the power of mathematical "projections" for the EMR data to correct patient data. Consequently, we measure the distance of corrected data from the constraints defining a healthy range of patient data, resulting in a unique predictive metric we term as "trust-scores". These scores provide insight into the patient's health status and significantly boost the performance of ML classifiers in real-life clinical settings. We validate the impact of our framework in the context of early detection of sepsis using ML. We show an AUROC of 0.865 and a precision of 0.922, that surpasses conventional ML models without such projections.


GBM-based Bregman Proximal Algorithms for Constrained Learning

arXiv.org Artificial Intelligence

As the complexity of learning tasks surges, modern machine learning encounters a new constrained learning paradigm characterized by more intricate and data-driven function constraints. Prominent applications include Neyman-Pearson classification (NPC) and fairness classification, which entail specific risk constraints that render standard projection-based training algorithms unsuitable. Gradient boosting machines (GBMs) are among the most popular algorithms for supervised learning; however, they are generally limited to unconstrained settings. In this paper, we adapt the GBM for constrained learning tasks within the framework of Bregman proximal algorithms. We introduce a new Bregman primal-dual method with a global optimality guarantee when the learning objective and constraint functions are convex. In cases of nonconvex functions, we demonstrate how our algorithm remains effective under a Bregman proximal point framework. Distinct from existing constrained learning algorithms, ours possess a unique advantage in their ability to seamlessly integrate with publicly available GBM implementations such as XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017), exclusively relying on their public interfaces. We provide substantial experimental evidence to showcase the effectiveness of the Bregman algorithm framework. While our primary focus is on NPC and fairness ML, our framework holds significant potential for a broader range of constrained learning applications. The source code is currently freely available at https://github.com/zhenweilin/ConstrainedGBM}{https://github.com/zhenweilin/ConstrainedGBM.