Goto

Collaborating Authors

 Ensemble Learning


Greenhouse gases emissions: estimating corporate non-reported emissions using interpretable machine learning

arXiv.org Artificial Intelligence

As of 2022, greenhouse gases (GHG) emissions reporting and auditing are not yet compulsory for all companies and methodologies of measurement and estimation are not unified. We propose a machine learning-based model to estimate scope 1 and scope 2 GHG emissions of companies not reporting them yet. Our model, specifically designed to be transparent and completely adapted to this use case, is able to estimate emissions for a large universe of companies. It shows good out-of-sample global performances as well as good out-of-sample granular performances when evaluating it by sectors, by countries or by revenues buckets. We also compare our results to those of other providers and find our estimates to be more accurate. Thanks to the proposed explainability tools using Shapley values, our model is fully interpretable, the user being able to understand which factors split explain the GHG emissions for each particular company.


Clinical Deterioration Prediction in Brazilian Hospitals Based on Artificial Neural Networks and Tree Decision Models

arXiv.org Artificial Intelligence

Early recognition of clinical deterioration (CD) has vital importance in patients' survival from exacerbation or death. Electronic health records (EHRs) data have been widely employed in Early Warning Scores (EWS) to measure CD risk in hospitalized patients. Recently, EHRs data have been utilized in Machine Learning (ML) models to predict mortality and CD. The ML models have shown superior performance in CD prediction compared to EWS. Since EHRs data are structured and tabular, conventional ML models are generally applied to them, and less effort is put into evaluating the artificial neural network's performance on EHRs data. Thus, in this article, an extremely boosted neural network (XBNet) is used to predict CD, and its performance is compared to eXtreme Gradient Boosting (XGBoost) and random forest (RF) models. For this purpose, 103,105 samples from thirteen Brazilian hospitals are used to generate the models. Moreover, the principal component analysis (PCA) is employed to verify whether it can improve the adopted models' performance. The performance of ML models and Modified Early Warning Score (MEWS), an EWS candidate, are evaluated in CD prediction regarding the accuracy, precision, recall, F1-score, and geometric mean (G-mean) metrics in a 10-fold cross-validation approach. According to the experiments, the XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.


Learning Inter-Annual Flood Loss Risk Models From Historical Flood Insurance Claims and Extreme Rainfall Data

arXiv.org Artificial Intelligence

Flooding is one of the most disastrous natural hazards, responsible for substantial economic losses. A predictive model for flood-induced financial damages is useful for many applications such as climate change adaptation planning and insurance underwriting. This research assesses the predictive capability of regressors constructed on the National Flood Insurance Program (NFIP) dataset using neural networks (Conditional Generative Adversarial Networks), decision trees (Extreme Gradient Boosting), and kernel-based regressors (Gaussian Process). The assessment highlights the most informative predictors for regression. The distribution for claims amount inference is modeled with a Burr distribution permitting the introduction of a bias correction scheme and increasing the regressor's predictive capability. Aiming to study the interaction with physical variables, we incorporate Daymet rainfall estimation to NFIP as an additional predictor. A study on the coastal counties in the eight US South-West states resulted in an $R^2=0.807$. Further analysis of 11 counties with a significant number of claims in the NFIP dataset reveals that Extreme Gradient Boosting provides the best results, that bias correction significantly improves the similarity with the reference distribution, and that the rainfall predictor strengthens the regressor performance.


Simplification of Forest Classifiers and Regressors

arXiv.org Artificial Intelligence

We study the problem of sharing as many branching conditions of a given forest classifier or regressor as possible while keeping classification performance. As a constraint for preventing from accuracy degradation, we first consider the one that the decision paths of all the given feature vectors must not change. For a branching condition that a value of a certain feature is at most a given threshold, the set of values satisfying such constraint can be represented as an interval. Thus, the problem is reduced to the problem of finding the minimum set intersecting all the constraint-satisfying intervals for each set of branching conditions on the same feature. We propose an algorithm for the original problem using an algorithm solving this problem efficiently. The constraint is relaxed later to promote further sharing of branching conditions by allowing decision path change of a certain ratio of the given feature vectors or allowing a certain number of non-intersected constraint-satisfying intervals. We also extended our algorithm for both the relaxations. The effectiveness of our method is demonstrated through comprehensive experiments using 21 datasets (13 classification and 8 regression datasets in UCI machine learning repository) and 4 classifiers/regressors (random forest, extremely randomized trees, AdaBoost and gradient boosting).


MABSplit: Faster Forest Training Using Multi-Armed Bandits

arXiv.org Artificial Intelligence

Random forests are some of the most widely used machine learning models today, especially in domains that necessitate interpretability. We present an algorithm that accelerates the training of random forests and other popular tree-based learning methods. At the core of our algorithm is a novel node-splitting subroutine, dubbed MABSplit, used to efficiently find split points when constructing decision trees. Our algorithm borrows techniques from the multi-armed bandit literature to judiciously determine how to allocate samples and computational power across candidate split points. We provide theoretical guarantees that MABSplit improves the sample complexity of each node split from linear to logarithmic in the number of data points. In some settings, MABSplit leads to 100x faster training (an 99% reduction in training time) without any decrease in generalization performance. We demonstrate similar speedups when MABSplit is used across a variety of forest-based variants, such as Extremely Random Forests and Random Patches. We also show our algorithm can be used in both classification and regression tasks. Finally, we show that MABSplit outperforms existing methods in generalization performance and feature importance calculations under a fixed computational budget.


How to Choose n_estimators in Random Forest ? Get Solution

#artificialintelligence

Are you looking for how to choose n_estimators in the random forest? Actually, n_estimators defines in the underline decision tree in Random Forest. See! the Random Forest algorithms is a bagging Technique. Where we ensemble many weak learn to decrease the variance. The n_estimators is a hyperparameter for Random Forest.


Elevated PTH Levels Prediction Based on Machine Learning – Physician's Weekly

#artificialintelligence

Six machine-learning prediction models (logistic regression with and without splines, lasso regression, random forest, gradient-boosting machines [ …


Ace your Machine Learning Interview -- Part 8

#artificialintelligence

In this article of my series "Ace your Machine Learning Interview" I continue to talk about Ensemble Learning and in particular, I will focus on Boosting algorithms with special reference to AdaBoost. I hope that this series in which I review the basics of Machine Learning will be useful to you in facing your next interview! We talked in the last article in general about what Ensemble Learning is and we have seen and implemented simple Ensmble methods based on Majority Voting. Today we talk more in detail about an Ensemble method called Boosting by making special reference to Adaptive Boosting or AdaBoost. You may have heard of this algorithm before, it is often used to win Kaggle competitions for example.


MLC at HECKTOR 2022: The Effect and Importance of Training Data when Analyzing Cases of Head and Neck Tumors using Machine Learning

arXiv.org Artificial Intelligence

Head and neck cancers are the fifth most common cancer worldwide, and recently, analysis of Positron Emission Tomography (PET) and Computed Tomography (CT) images has been proposed to identify patients with a prognosis. Even though the results look promising, more research is needed to further validate and improve the results. This paper presents the work done by team MLC for the 2022 version of the HECKTOR grand challenge held at MICCAI 2022. For Task 1, the automatic segmentation task, our approach was, in contrast to earlier solutions using 3D segmentation, to keep it as simple as possible using a 2D model, analyzing every slice as a standalone image. In addition, we were interested in understanding how different modalities influence the results. We proposed two approaches; one using only the CT scans to make predictions and another using a combination of the CT and PET scans. For Task 2, the prediction of recurrence-free survival, we first proposed two approaches, one where we only use patient data and one where we combined the patient data with segmentations from the image model. For the prediction of the first two approaches, we used Random Forest. In our third approach, we combined patient data and image data using XGBoost. Low kidney function might worsen cancer prognosis. In this approach, we therefore estimated the kidney function of the patients and included it as a feature. Overall, we conclude that our simple methods were not able to compete with the highest-ranking submissions, but we still obtained reasonably good scores. We also got interesting insights into how the combination of different modalities can influence the segmentation and predictions.


Extracting personal information from anonymous cell phone data using machine learning

#artificialintelligence

A research team at Illinois Institute of Technology has extracted personal information, specifically protected characteristics like age and gender, from anonymous cell phone data using machine learning and artificial intelligence algorithms, raising questions about data security. The research was conducted by an interdisciplinary team of three Illinois Tech faculty including Vijay K. Gurbani, research associate professor of computer science; Matthew Shapiro, professor of political science; and Yuri Mansury, associate professor of social sciences. They were joined by Illinois Tech alumni Lida Kuang (M.S. CS '19) and Samruda Pobbathi (M.S. CS '19) who worked with Gurbani to publish "Predicting Age and Gender from Network Telemetry: Implications for Privacy and Impact on Policy" in PLOS One. The researchers used data from a Latin American cell phone company to successfully estimate the gender and age of individual users through their private communications with relative ease. The team developed a neural network model to estimate gender with 67% accuracy, which outperforms modern techniques such as decision tree, random forest, and gradient boosting models by a significant margin.