AITopics | Ensemble Learning

Collaborating Authors

Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

A random forest based approach for predicting spreads in the primary catastrophe bond market

Makariou, Despoina, Barrieu, Pauline, Chen, Yining

arXiv.org Machine LearningJan-28-2020

We introduce a random forest approach to enable spreads' prediction in the primary catastrophe bond market. We investigate whether all information provided to investors in the offering circular prior to a new issuance is equally important in predicting its spread. The whole population of non-life catastrophe bonds issued from December 2009 to May 2018 is used. The random forest shows an impressive predictive power on unseen primary catastrophe bond data explaining 93% of the total variability. For comparison, linear regression, our benchmark model, has inferior predictive performance explaining only 47% of the total variability. All details provided in the offering circular are predictive of spread but in a varying degree. The stability of the results is studied. The usage of random forest can speed up investment decisions in the catastrophe bond industry.

catastrophe bond, predictor, random forest, (15 more...)

arXiv.org Machine Learning

2001.10393

Country:

Europe (0.04)
South America (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(3 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Banking & Finance > Trading (1.00)
Banking & Finance > Insurance (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)

Add feedback

LiteMORT: A memory efficient gradient boosting tree system on adaptive compact distributions

Chen, Yingshi

arXiv.org Machine LearningJan-26-2020

Gradient boosted decision trees (GBDT) is the leading algorithm for many commercial and academic data applications. We give a deep analysis of this algorithm, especially the histogram technique, which is a basis for the regulized distribution with compact support. We present three new modifications. 1) Share memory technique to reduce memory usage. In many cases, it only need the data source itself and no extra memory. 2) Implicit merging for "merge overflow problem"."merge overflow" means that merge some small datasets to huge datasets, which are too huge to be solved. By implicit merging, we just need the original small datasets to train the GBDT model. 3) Adaptive resize algorithm of histogram bins to improve accuracy. Experiments on two large Kaggle competitions verified our methods. They use much less memory than LightGBM and have higher accuracy. We have implemented these algorithms in an open-source package LiteMORT. The source codes are available at https://github.com/closest-git/LiteMORT

algorithm, bin, litemort, (14 more...)

arXiv.org Machine Learning

2001.09419

Country:

Asia > China > Fujian Province > Xiamen (0.05)
North America > Montserrat (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

Stratified cross-validation for unbiased and privacy-preserving federated learning

Bey, R., Goussault, R., Benchoufi, M., Porcher, R.

arXiv.org Machine LearningJan-23-2020

Large-scale collections of electronic records constitute both an opportunity for the development of more accurate prediction models and a threat for privacy. To limit privacy exposure new privacy-enhancing techniques are emerging such as federated learning which enables large-scale data analysis while avoiding the centralization of records in a unique database that would represent a critical point of failure. Although promising regarding privacy protection, federated learning prevents using some data-cleaning algorithms thus inducing new biases. In this work we focus on the recurrent problem of duplicated records that, if not handled properly, may cause over-optimistic estimations of a model's performances. We introduce and discuss stratified cross-validation, a validation methodology that leverages stratification techniques to prevent data leakage in federated learning settings without relying on demanding deduplication algorithms.

covariate, federated learning, stratified cross-validation, (15 more...)

arXiv.org Machine Learning

2001.0809

Country:

Europe > France > Île-de-France > Paris > Paris (0.14)
Europe > France > Pays de la Loire > Loire-Atlantique > Nantes (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (0.67)
Information Technology > Data Science > Data Mining > Big Data (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.47)
(2 more...)

Add feedback

Time series forecasting with random forest

#artificialintelligenceJan-21-2020, 19:02:31 GMT

Benjamin Franklin said that only two things are certain in life: death and taxes. That explains why my colleagues at STATWORX were less than excited when they told me about their plans for the weekend a few weeks back: doing their income tax declaration. Man, I thought, that sucks, I'd rather spend this time outdoors. And then an idea was born. What could taxes and the outdoors possibly have in common?

forecasting, random forest, time sery, (11 more...)

#artificialintelligence

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.06)
Europe > Switzerland > Zürich > Zürich (0.05)
Europe > Austria > Vienna (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.60)

Add feedback

SGLB: Stochastic Gradient Langevin Boosting

Ustimenko, Aleksei, Prokhorenkova, Liudmila

arXiv.org Machine LearningJan-20-2020

In this paper, we introduce Stochastic Gradient Langevin Boosting (SGLB) -- a powerful and efficient machine learning framework, which may deal with a wide range of loss functions and has provable generalization guarantees. The method is based on a special form of Langevin Diffusion equation specifically designed for gradient boosting. This allows us to guarantee the global convergence, while standard gradient boosting algorithms can guarantee only local optima, which is a problem for multimodal loss functions. To illustrate the advantages of SGLB, we apply it to a classification task with 0-1 loss function, which is known to be multimodal, and to a standard Logistic regression task that is convex. The algorithm is implemented as a part of the CatBoost gradient boosting library and outperforms classic gradient boosting methods.

algorithm, convergence, gradient, (14 more...)

arXiv.org Machine Learning

2001.07248

Country:

North America > United States (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Asia > Russia (0.04)

Genre:

Research Report > Experimental Study (0.48)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)

Add feedback

From local explanations to global understanding with explainable AI for trees

#artificialintelligenceJan-18-2020, 22:15:00 GMT

Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model's performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains.

explainable ai, explanation, local explanation, (4 more...)

#artificialintelligence

Country: North America > United States (0.29)

Industry: Health & Medicine > Therapeutic Area > Nephrology (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.63)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.54)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.40)

Add feedback

Forecasting Corn Yield with Machine Learning Ensembles

Shahhosseini, Mohsen, Hu, Guiping, Archontoulis, Sotirios V.

arXiv.org Machine LearningJan-17-2020

The emerge of new technologies to synthesize and analyze big data with high-performance computing, has increased our capacity to more accurately predict crop yields. Recent research has shown that Machine learning (ML) can provide reasonable predictions, faster, and with higher flexibility compared to simulation crop modeling. The earlier the prediction during the growing season the better, but this has not been thoroughly investigated as previous studies considered all data available to predict yields. This paper provides a machine learning based framework to forecast corn yields in three US Corn Belt states (Illinois, Indiana, and Iowa) considering complete and partial in-season weather knowledge. Several ensemble models are designed using blocked sequential procedure to generate out-of-bag predictions. The forecasts are made in county-level scale and aggregated for agricultural district, and state level scales. Results show that ensemble models based on weighted average of the base learners outperform individual models. Specifically, the proposed ensemble model could achieve best prediction accuracy (RRMSE of 7.8%) and least mean bias error (-6.06 bu/acre) compared to other developed models. Comparing our proposed model forecasts with the literature demonstrates the superiority of forecasts made by our proposed ensemble model. Results from the scenario of having partial in-season weather knowledge reveal that decent yield forecasts can be made as early as June 1st. To find the marginal effect of each input feature on the forecasts made by the proposed ensemble model, a methodology is suggested that is the basis for finding feature importance for the ensemble model. The findings suggest that weather features corresponding to weather in weeks 18-24 (May 1st to June 1st) are the most important input features.

ensemble, ensemble model, prediction, (14 more...)

arXiv.org Machine Learning

2001.09055

Country:

North America > United States > Indiana (0.24)
North America > United States > Illinois (0.24)
North America > United States > Iowa > Story County > Ames (0.04)
(3 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Food & Agriculture > Agriculture (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.47)

Add feedback

Machine learning for total cloud cover prediction

Baran, Ágnes, Lerch, Sebastian, Ayari, Mehrez El, Baran, Sándor

arXiv.org Machine LearningJan-16-2020

Accurate and reliable forecasting of total cloud cover (TCC) is vital for many areas such as astronomy, energy demand and production, or agriculture. Most meteorological centres issue ensemble forecasts of TCC, however, these forecasts are often uncalibrated and exhibit worse forecast skill than ensemble forecasts of other weather variables. Hence, some form of post-processing is strongly required to improve predictive performance. As TCC observations are usually reported on a discrete scale taking just nine different values called oktas, statistical calibration of TCC ensemble forecasts can be considered a classification problem with outputs given by the probabilities of the oktas. This is a classical area where machine learning methods are applied. We investigate the performance of post-processing using multilayer percep-tron (MLP) neural networks, gradient boosting machines (GBM) and random forest (RF) methods. Based on the European Centre for Medium-Range Weather Forecasts global TCC ensemble forecasts for 2002-2014 we compare these approaches with the proportional odds logistic regression (POLR) and multiclass logistic regression (MLR) models, as well as the raw TCC ensemble forecasts. We further assess whether improvements in forecast skill can be obtained by incorporating ensemble forecasts of precipitation as additional predictor. Compared to the raw ensemble, all calibration methods result in a significant improvement in forecast skill. RF models provide the smallest increase in predictive performance, while MLP, POLR and GBM approaches perform best. Key words: ensemble calibration; gradient boosting machine; logistic regression; mul-tilayer perceptron; random forest; total cloud cover 1 Introduction Reliable and accurate prediction of total cloud cover (TCC) has a principal importance in observational astronomy (Ye and Chen, 2013) and in the prediction of photovoltaic energy production, as it is the main cause of variation in solar-radiation energy supply (Matuszko, 2012; McEvoy et al., 2012), but it is also of great relevance in agriculture, tourism and in some other fields of economy.

ensemble forecast, forecast, tcc ensemble forecast, (15 more...)

arXiv.org Machine Learning

2001.05948

Country:

Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)
North America > United States > New York (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)

Genre: Research Report > Experimental Study (0.75)

Industry: Energy > Renewable > Solar (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.75)

Add feedback

Reproducible Bootstrap Aggregating

Liu, Meimei, Dunson, David B.

arXiv.org Machine LearningJan-12-2020

Heterogeneity between training and testing data degrades reproducibility of a well-trained predictive algorithm. In modern applications, how to deploy a trained algorithm in a different domain is becoming an urgent question raised by many domain scientists. In this paper, we propose a reproducible bootstrap aggregating (Rbagging) method coupled with a new algorithm, the iterative nearest neighbor sampler (INNs), effectively drawing bootstrap samples from training data to mimic the distribution of the test data. Rbagging is a general ensemble framework that can be applied to most classifiers. We further propose Rbagging+ to effectively detect anomalous samples in the testing data. Our theoretical results show that the resamples based on Rbagging have the same distribution as the testing data. Moreover, under suitable assumptions, we further provide a general bound to control the test excess risk of the ensemble classifiers. The proposed method is compared with several other popular domain adaptation methods via extensive simulation studies and real applications including medical diagnosis and imaging classifications.

classifier, rbagging, testing data, (16 more...)

arXiv.org Machine Learning

2001.03988

Country:

North America > United States > Wisconsin (0.04)
North America > United States > North Carolina > Durham County > Durham (0.04)
North America > United States > California > Orange County > Irvine (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)
(2 more...)

Add feedback

r/MachineLearning - [P] Me and a friend made an online insult classifier as a university assignment. We could use your help estimating its real life usage accuracy.

#artificialintelligenceJan-11-2020, 19:01:37 GMT

Basically we had to clean up about 4000 labelled examples of insult / not insult, vectorize it (just bag of words at the moment) and find a good classifier. What we found worked best was random forest and gradient boosting, with random forest rated at ROC 0.962 running on the web version at the moment. It's set up so one can grade each attempt and will log it accordingly. Since the data set is rather small I don't think the usual metrics can give a realistic result in terms of actual accuracy. But if we get it tested enough we should be able to estimate it from the received grades and also use those examples to further extend the data set.

online insult class ifier, real life usage accuracy, university assignment, (1 more...)

#artificialintelligence

Industry: Education > Educational Setting > Online (0.85)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.96)

Add feedback