AITopics | Ensemble Learning

Collaborating Authors

Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach

Chivers, Benedict Delahaye, Wallbank, John, Cole, Steven J., Sebek, Ondrej, Stanley, Simon, Fry, Matthew, Leontidis, Georgios

arXiv.org Machine LearningMay-2-2020

Precipitation data collected at sub-hourly resolution represents specific challenges for missing data recovery by being largely stochastic in nature and highly unbalanced in the duration of rain vs nonrain. Here we present a two-step analysis utilising current machine learning techniques for imputing precipitation data sampled at 30-minute intervals by devolving the task into (a) the classification of rain or non-rain samples, and (b) regressing the absolute values of predicted rain samples. Investigating 37 weather stations in the UK, this machine learning process produces more accurate predictions for recovering precipitation data than an established surface fitting technique utilising neighbouring rain gauges. Increasing available features for the training of machine learning algorithms increases performance with the integration of weather data at the target site with externally sourced rain gauges providing the highest performance. This method informs machine learning models by utilising information in concurrently collected environmental data to make accurate predictions of missing rain data. Capturing complex nonlinear relationships from weakly correlated variables is critical for data recovery at sub-hourly resolutions. Such pipelines for data recovery can be developed and deployed for highly automated and near instantaneous imputation of missing values in ongoing datasets at high temporal resolutions. Keywords: machine learning, data imputation, gradient boosted trees, environmental sensor networks, precipitation, soil moisture 1. Introduction Precipitation data is of critical importance across multiple lines of enquiry, informing statistical models and analysis relating to weather forecasting, extreme weather events, climate change, water-resource management, droughts, flooding, agricultural impact, and hydroelectric power. Historical rainfall data can reveal long term trends in environmental hydrological issues with real-time data input allowing for immediate forecasting of future conditions. Distributed networks of rain gauges are typically used to provide precipitation data at the earth's surface at varying temporal resolutions and can cover large geographical areas (Kidd, 2001). As is the case in many databases, particularly those utilising physical sensors, the problem of missing data arises. Missing data can be a result of sensor failure, data storage/transmission failure, or post-collection quality control procedures resulting in removal of identified problem data (Blenkinsop et al., 2017). Missing data in precipitation databases represents a serious limitation for the effective use of the data. Given the global scale and importance of precipitation and meteorological data (Sun et al., 2018), developing solutions to missing data is of paramount importance for maximising information gain.

artificial intelligence, gauge, machine learning, (19 more...)

arXiv.org Machine Learning

doi: 10.1016/j.jhydrol.2020.125126

2004.11123

Country:

Europe > Western Europe (0.04)
Europe > United Kingdom > Wales (0.04)
Europe > United Kingdom > Scotland (0.04)
(6 more...)

Genre: Research Report (0.82)

Industry:

Energy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.69)

Add feedback

DriveML: An R Package for Driverless Machine Learning

Putatunda, Sayan, Ubrangala, Dayananda, Rama, Kiran, Kondapalli, Ravi

arXiv.org Machine LearningMay-1-2020

In recent years, the concept of automated machine learning has become very popular. Automated Machine Learning (AutoML) mainly refers to the automated methods for model selection and hyper-parameter optimization of various algorithms such as random forests, gradient boosting, neural networks, etc. In this paper, we introduce a new package i.e. DriveML for automated machine learning. DriveML helps in implementing some of the pillars of an automated machine learning pipeline such as automated data preparation, feature engineering, model building and model explanation by running the function instead of writing lengthy R codes. The DriveML package is available in CRAN. We compare the DriveML package with other relevant packages in CRAN/Github and find that DriveML performs the best across different parameters. We also provide an illustration by applying the DriveML package with default configuration on a real world dataset. Overall, the main benefits of DriveML are in development time savings, reduce developer's errors, optimal tuning of machine learning models and reproducibility.

dataset, driveml package, machine learning, (13 more...)

arXiv.org Machine Learning

2005.00478

Country:

Europe > Austria > Vienna (0.14)
Europe > Portugal > Lisbon > Lisbon (0.04)
Asia > India (0.04)

Genre: Research Report (0.69)

Industry: Health & Medicine > Therapeutic Area (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.51)

Add feedback

Exploiting Categorical Structure Using Tree-Based Methods

Lucena, Brian

arXiv.org Artificial IntelligenceApr-15-2020

Standard methods of using categorical variables as predictors either endow them with an ordinal structure or assume they have no structure at all. However, categorical variables often possess structure that is more complicated than a linear ordering can capture. We develop a mathematical framework for representing the structure of categorical variables and show how to generalize decision trees to make use of this structure. This approach is applicable to methods such as Gradient Boosted Trees which use a decision tree as the underlying learner. We show results on weather data to demonstrate the improvement yielded by this approach.

decision tree, partition, terrain, (14 more...)

arXiv.org Artificial Intelligence

2004.07383

Country:

North America > United States > California > San Francisco County > San Francisco (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > District of Columbia (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.90)

Add feedback

A Modified Bayesian Optimization based Hyper-Parameter Tuning Approach for Extreme Gradient Boosting

Putatunda, Sayan, Rama, Kiran

arXiv.org Machine LearningApr-10-2020

It is already reported in the literature that the performance of a machine learning algorithm is greatly impacted by performing proper Hyper-Parameter optimization. One of the ways to perform Hyper-Parameter optimization is by manual search but that is time consuming. Some of the common approaches for performing Hyper-Parameter optimization are Grid search Random search and Bayesian optimization using Hyperopt. In this paper, we propose a brand new approach for hyperparameter improvement i.e. Randomized-Hyperopt and then tune the hyperparameters of the XGBoost i.e. the Extreme Gradient Boosting algorithm on ten datasets by applying Random search, Randomized-Hyperopt, Hyperopt and Grid Search. The performances of each of these four techniques were compared by taking both the prediction accuracy and the execution time into consideration. We find that the Randomized-Hyperopt performs better than the other three conventional methods for hyper-paramter optimization of XGBoost.

dataset, hyperparameter, optimization, (13 more...)

arXiv.org Machine Learning

doi: 10.1109/ICInPro47689.2019.9092025

2004.05041

Country:

Asia > India > Karnataka > Bengaluru (0.06)
Europe > France (0.04)
Asia > Taiwan (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.71)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
(2 more...)

Add feedback

Adversarial Validation Approach to Concept Drift Problem in Automated Machine Learning Systems

Pan, Jing, Pham, Vincent, Dorairaj, Mohan, Chen, Huigang, Lee, Jeong-Yoon

arXiv.org Machine LearningApr-6-2020

In automated machine learning systems, concept drift in input data is one of the main challenges. It deteriorates model performance on new data over time. Previous research on concept drift mostly proposed model retraining after observing performance decreases. However, this approach is suboptimal because the system fixes the problem only after suffering from poor performance on new data. Here, we introduce an adversarial validation approach to concept drift problems in automated machine learning systems. With our approach, the system detects concept drift in new data before making inference, trains a model, and produces predictions adapted to the new data. We show that our approach addresses concept drift effectively with the AutoML3 Lifelong Machine Learning challenge data as well as in Uber's internal automated machine learning system, MaLTA.

classifier, dataset, validation, (12 more...)

arXiv.org Machine Learning

2004.03045

Country:

Europe > Middle East > Malta (0.27)
North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Spain > Galicia > Madrid (0.04)

Genre: Research Report > Experimental Study (0.71)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

XtracTree for Regulator Validation of Bagging Methods Used in Retail Banking

Charlier, Jeremy, Makarenkov, Vladimir

arXiv.org Artificial IntelligenceApr-5-2020

Bootstrap aggregation, known as bagging, is one of the most popular ensemble methods used in machine learning (ML). An ensemble method is a supervised ML method that combines multiple hypotheses to form a single hypothesis used for prediction. A bagging algorithm combines multiple classifiers modelled on different sub-samples of the same data set to build one large classifier. Large retail banks are nowadays using the power of ML algorithms, including decision trees and random forests, to optimize the retail banking activities. However, AI bank researchers face a strong challenge from their own model validation department as well as from national financial regulators. Each proposed ML model has to be validated and clear rules for every algorithm-based decision have to be established. In this context, we propose XtracTree, an algorithm that is capable of effectively converting an ML bagging classifier, such as a decision tree or a random forest, into simple "if-then" rules satisfying the requirements of model validation. Our algorithm is also capable of highlighting the decision path for each individual sample or a group of samples, addressing any concern from the regulators regarding ML "black-box". We use a public loan data set from Kaggle to illustrate the usefulness of our approach. Our experiments indicate that, using XtracTree, we are able to ensure a better understanding for our model, leading to an easier model validation by national financial regulators and the internal model validation department.

algorithm, classifier, xtractree, (15 more...)

arXiv.org Artificial Intelligence

2004.02326

Country:

North America > Canada > Quebec > Montreal (0.14)
Asia > China (0.04)
Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)

Genre: Research Report (0.64)

Industry:

Law > Business Law (0.75)
Banking & Finance > Loans (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost

#artificialintelligenceApr-3-2020, 23:28:37 GMT

Gradient boosting is a powerful ensemble machine learning algorithm. It's popular for structured predictive modeling problems, such as classification and regression on tabular data, and is often the main algorithm or one of the main algorithms used in winning solutions to machine learning competitions, like those on Kaggle. There are many implementations of gradient boosting available, including standard implementations in SciPy and efficient third-party libraries. Each uses a different interface and even different names for the algorithm. In this tutorial, you will discover how to use gradient boosting models for classification and regression in Python. Standardized code examples are provided for the four major implementations of gradient boosting in Python, ready for you to copy-paste and use in your own predictive modeling project.

implementation, library, single prediction, (15 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

Lightning Fast XGBoost on Multiple GPUs

#artificialintelligenceApr-2-2020, 21:19:02 GMT

XGBoost is one of the most used libraries fora data science. At the time XGBoost came into existence, it was lightning fast compared to its nearest rival Python's Scikit-learn GBM. But as the times have progressed, it has been rivaled by some awesome libraries like LightGBM and Catboost, both on speed as well as accuracy. I, for one, use LightGBM for most of the use cases where I have just got CPU for training. But when I have a GPU or multiple GPUs at my disposal, I still love to train with XGBoost.

gpus, multiple gpus, xgboost, (8 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

Unpack Local Model Interpretation for GBDT

Fang, Wenjing, Zhou, Jun, Li, Xiaolong, Zhu, Kenny Q.

arXiv.org Machine LearningApr-2-2020

Because GBDT inherits the good performance from its ensemble essence, much attention has been drawn to the optimization of this model. With its popularization, an increasing need for model interpretation arises. Besides the commonly used feature importance as a global interpretation, feature contribution is a local measure that reveals the relationship between a specific instance and the related output. This work focuses on the local interpretation and proposes an unified computation mechanism to get the instance-level feature contributions for GBDT in any version. Practicality of this mechanism is validated by the listed experiments as well as applications in real industry scenarios.

feature contribution, gbdt, interpretation, (15 more...)

arXiv.org Machine Learning

2004.01358

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.77)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Predicting Injectable Medication Adherence via a Smart Sharps Bin and Machine Learning

Gu, Yingqi, Zalkikar, Akshay, Kelly, Lara, Daly, Kieran, Ward, Tomas E.

arXiv.org Machine LearningApr-2-2020

Medication non-adherence is a widespread problem affecting over 50% of people who have chronic illness and need chronic treatment. Non-adherence exacerbates health risks and drives significant increases in treatment costs. In order to address these challenges, the importance of predicting patients' adherence has been recognised. In other words, it is important to improve the efficiency of interventions of the current healthcare system by prioritizing resources to the patients who are most likely to be non-adherent. Our objective in this work is to make predictions regarding individual patients' behaviour in terms of taking their medication on time during their next scheduled medication opportunity. We do this by leveraging a number of machine learning models. In particular, we demonstrate the use of a connected IoT device; a "Smart Sharps Bin", invented by HealthBeacon Ltd.; to monitor and track injection disposal of patients in their home environment. Using extensive data collected from these devices, five machine learning models, namely Extra Trees Classifier, Random Forest, XGBoost, Gradient Boosting and Multilayer Perception were trained and evaluated on a large dataset comprising 165,223 historic injection disposal records collected from 5,915 HealthBeacon units over the course of 3 years. The testing work was conducted on real-time data generated by the smart device over a time period after the model training was complete, i.e. true future data. The proposed machine learning approach demonstrated very good predictive performance exhibiting an Area Under the Receiver Operating Characteristic Curve (ROC AUC) of 0.86.

adherence, medication, on-time, (16 more...)

arXiv.org Machine Learning

2004.01144

Country:

North America > United States (0.46)
Europe > Ireland (0.05)
South America (0.04)
(5 more...)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Government Relations & Public Policy (1.00)
Information Technology > Security & Privacy (0.94)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback