AITopics | Ensemble Learning

Collaborating Authors

Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Scalable Feature Selection for (Multitask) Gradient Boosted Trees

Han, Cuize, Rao, Nikhil, Sorokina, Daria, Subbian, Karthik

arXiv.org Machine LearningSep-4-2021

Gradient Boosted Decision Trees (GBDTs) are widely used for building ranking and relevance models in search and recommendation. Considerations such as latency and interpretability dictate the use of as few features as possible to train these models. Feature selection in GBDT models typically involves heuristically ranking the features by importance and selecting the top few, or by performing a full backward feature elimination routine. On-the-fly feature selection methods proposed previously scale suboptimally with the number of features, which can be daunting in high dimensional settings. We develop a scalable forward feature selection variant for GBDT, via a novel group testing procedure that works well in high dimensions, and enjoys favorable theoretical performance and computational guarantees. We show via extensive experiments on both public and proprietary datasets that the proposed method offers significant speedups in training time, while being as competitive as existing GBDT methods in terms of model performance metrics. We also extend the method to the multitask setting, allowing the practitioner to select common features across tasks, as well as selecting task-specific features.

dataset, gt-gbm, scalable feature selection, (13 more...)

arXiv.org Machine Learning

2109.01965

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Italy (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

XGBoost Regression: Explain It To Me Like I'm 10

#artificialintelligenceSep-3-2021, 11:05:28 GMT

When I was just starting on my quest to understand Machine Learning algorithms, I would get overwhelmed with all the math-y stuff. I found it difficult to understand the math behind an algorithm without fully grasping the intuition. So I would gravitate towards sources that completely broke down the algorithm into simple steps and made it digestible to someone who never even heard the word Algorithm before. Okay, that is a blatant exaggeration, but you know what I mean. So that's what I'm attempting to do now.

master, prediction, residual, (16 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.43)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.43)

Add feedback

LightAutoML: AutoML Solution for a Large Financial Services Ecosystem

Vakhrushev, Anton, Ryzhkov, Alexander, Savchenko, Maxim, Simakov, Dmitry, Damdinov, Rinchin, Tuzhilin, Alexander

arXiv.org Machine LearningSep-3-2021

In particular, our ecosystem has the satisfying the set of idiosyncratic requirements that this ecosystem following set of requirements: has for AutoML solutions. Our framework was piloted and deployed in numerous applications and performed at the level of - AutoML system should be able to work with different types the experienced data scientists while building high-quality ML of data collected from hundreds of different information models significantly faster than these data scientists. We also compare systems and often changes more rapidly than these systems the performance of our system with various general-purpose can be fully documented using metadata and painstakingly open source AutoML solutions and show that it performs better for preprocessed by data scientists for the ML tasks using ETL most of the ecosystem and OpenML problems. We also present the tools.

dataset, ecosystem, lightautoml, (13 more...)

arXiv.org Machine Learning

2109.01528

Genre: Research Report (1.00)

Industry:

Information Technology > Software (0.93)
Information Technology > Services (0.68)
Banking & Finance > Financial Services (0.66)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

RF-LighGBM: A probabilistic ensemble way to predict customer repurchase behaviour in community e-commerce

Yang, Liping, Niu, Xiaxia, Wu, Jun

arXiv.org Artificial IntelligenceSep-2-2021

It is reported that the number of online payment users in China has reached 854 million; with the emergence of community e-commerce platforms, the trend of integration of e-commerce and social applications is increasingly intense. Community e-commerce is not a mature and sound comprehensive e-commerce with fewer categories and low brand value. To effectively retain community users and fully explore customer value has become an important challenge for community e-commerce operators. Given the above problems, this paper uses the data-driven method to study the prediction of community e-commerce customers' repurchase behaviour. The main research contents include 1. Given the complex problem of feature engineering, the classic model RFM in the field of customer relationship management is improved, and an improved model is proposed to describe the characteristics of customer buying behaviour, which includes five indicators. 2. In view of the imbalance of machine learning training samples in SMOTE-ENN, a training sample balance using SMOTE-ENN is proposed. The experimental results show that the machine learning model can be trained more effectively on balanced samples. 3. Aiming at the complexity of the parameter adjustment process, an automatic hyperparameter optimization method based on the TPE method was proposed. Compared with other methods, the model's prediction performance is improved, and the training time is reduced by more than 450%. 4. Aiming at the weak prediction ability of a single model, the soft voting based RF-LightgBM model was proposed. The experimental results show that the RF-LighTGBM model proposed in this paper can effectively predict customer repurchase behaviour, and the F1 value is 0.859, which is better than the single model and previous research results.

algorithm, customer, repurchase behaviour, (13 more...)

arXiv.org Artificial Intelligence

2109.00724

Country: Asia > China > Beijing > Beijing (0.05)

Genre: Research Report > New Finding (0.54)

Industry: Information Technology > Services > e-Commerce Services (1.00)

Technology:

Information Technology > e-Commerce (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
(2 more...)

Add feedback

When are Deep Networks really better than Random Forests at small sample sizes?

Xu, Haoyin, Ainsworth, Michael, Peng, Yu-Chung, Kusmanov, Madi, Panda, Sambit, Vogelstein, Joshua T.

arXiv.org Artificial IntelligenceAug-31-2021

Random forests (RF) and deep networks (DN) are two of the most popular machine learning methods in the current scientific literature and yield differing levels of performance on different data modalities. We wish to further explore and establish the conditions and domains in which each approach excels, particularly in the context of sample size and feature dimension. To address these issues, we tested the performance of these approaches across tabular, image, and audio settings using varying model parameters and architectures. Our focus is on datasets with at most 10,000 samples, which represent a large fraction of scientific and biomedical datasets. In general, we found RF to excel at tabular and structured data (image and audio) with small sample sizes, whereas DN performed better on structured data with larger sample sizes. Although we plan to continue updating this technical report in the coming months, we believe the current preliminary results may be of interest to others.

forest and network, latexit latexit sha1, sample size, (12 more...)

arXiv.org Artificial Intelligence

2108.13637

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > Canada > Ontario > Toronto (0.14)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Estonia > Harju County > Tallinn (0.04)

Genre: Research Report (0.93)

Industry: Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.85)

Add feedback

Ovarian Cancer Prediction from Ovarian Cysts Based on TVUS Using Machine Learning Algorithms

Akter, Laboni, Akhter, Nasrin

arXiv.org Machine LearningAug-30-2021

Ovarian Cancer (OC) is type of female reproductive malignancy which can be found among young girls and mostly the women in their fertile or reproductive. There are few number of cysts are dangerous and may it cause cancer. So, it is very important to predict and it can be from different types of screening are used for this detection using Transvaginal Ultrasonography (TVUS) screening. In this research, we employed an actual datasets called PLCO with TVUS screening and three machine learning (ML) techniques, respectively Random Forest KNN, and XGBoost within three target variables. We obtained a best performance from this algorithms as far as accuracy, recall, f1 score and precision with the approximations of 99.50%, 99.50%, 99.49% and 99.50% individually. The AUC score of 99.87%, 98.97% and 99.88% are observed in these Random Forest, KNN and XGB algorithms .This approach helps assist physicians and suspects in identifying ovarian risks early on, reducing ovarian malignancy-related complications and deaths.

artificial intelligence, machine learning, ovarian cancer, (14 more...)

arXiv.org Machine Learning

2108.13387

Country:

North America > United States (0.14)
Asia > Bangladesh (0.04)

Genre: Research Report (0.65)

Industry: Health & Medicine > Therapeutic Area > Oncology > Ovarian Cancer (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.98)

Add feedback

Survival Prediction of Heart Failure Patients using Stacked Ensemble Machine Learning Algorithm

Zaman, S. M Mehedi, Qureshi, Wasay Mahmood, Raihan, Md. Mohsin Sarker, Monjur, Ocean, Shams, Abdullah Bin

arXiv.org Machine LearningAug-30-2021

Cardiovascular disease, especially heart failure is one of the major health hazard issues of our time and is a leading cause of death worldwide. Advancement in data mining techniques using machine learning (ML) models is paving promising prediction approaches. Data mining is the process of converting massive volumes of raw data created by the healthcare institutions into meaningful information that can aid in making predictions and crucial decisions. Collecting various follow-up data from patients who have had heart failures, analyzing those data, and utilizing several ML models to predict the survival possibility of cardiovascular patients is the key aim of this study. Due to the imbalance of the classes in the dataset, Synthetic Minority Oversampling Technique (SMOTE) has been implemented. Two unsupervised models (K-Means and Fuzzy C-Means clustering) and three supervised classifiers (Random Forest, XGBoost and Decision Tree) have been used in our study. After thorough investigation, our results demonstrate a superior performance of the supervised ML algorithms over unsupervised models. Moreover, we designed and propose a supervised stacked ensemble learning model that can achieve an accuracy, precision, recall and F1 score of 99.98%. Our study shows that only certain attributes collected from the patients are imperative to successfully predict the surviving possibility post heart failure, using supervised ML algorithms.

algorithm, artificial intelligence, machine learning, (13 more...)

arXiv.org Machine Learning

2108.13367

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
North America > United States > California > Orange County > Irvine (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Identification of the Resting Position Based on EGG, ECG, Respiration Rate and SpO2 Using Stacked Ensemble Learning

Raihan, Md. Mohsin Sarker, Islam, Muhammad Muinul, Fairoz, Fariha, Shams, Abdullah Bin

arXiv.org Machine LearningAug-26-2021

Rest is essential for a high-level physiological and psychological performance. It is also necessary for the muscles to repair, rebuild, and strengthen. There is a significant correlation between the quality of rest and the resting posture. Therefore, identification of the resting position is of paramount importance to maintain a healthy life. Resting postures can be classified into four basic categories: Lying on the back (supine), facing of the left / right sides and free-fall position. The later position is already considered to be an unhealthy posture by researchers equivocally and hence can be eliminated. In this paper, we analyzed the other three states of resting position based on the data collected from the physiological parameters: Electrogastrogram (EGG), Electrocardiogram (ECG), Respiration Rate, Heart Rate, and Oxygen Saturation (SpO2). Based on these parameters, the resting position is classified using a hybrid stacked ensemble machine learning model designed using the Decision tree, Random Forest, and Xgboost algorithms. Our study demonstrates a 100% accurate prediction of the resting position using the hybrid model. The proposed method of identifying the resting position based on physiological parameters has the potential to be integrated into wearable devices. This is a low cost, highly accurate and autonomous technique to monitor the body posture while maintaining the user privacy by eliminating the use of RGB camera conventionally used to conduct the polysomnography (sleep Monitoring) or resting position studies.

accuracy, algorithm, resting position, (14 more...)

arXiv.org Machine Learning

2108.11604

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)

Genre: Research Report > New Finding (0.69)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.89)
Health & Medicine > Diagnostic Medicine > Vital Signs (0.62)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.71)

Add feedback

A guide to XGBoost hyperparameters

#artificialintelligenceAug-24-2021, 03:55:45 GMT

What is the one machine learning algorithm -- if you ask -- that consistently gives superior performance in regression and classification? It is arguably the most powerful algorithm and is increasingly being used in all industries and in all problem domains --from customer analytics and sales prediction to fraud detection and credit approval and more. It is also a winning algorithm in many machine learning competitions. In fact, XGBoost was used in 17 out of 29 data science competitions on the Kaggle platform. Not just in businesses and competitions, XGBoost has been used in scientific experiments such as the Large Hadron Collider (the Higgs Boson machine learning challenge). A key to its performance is its hyperparameters.

algorithm, hyperparameter, xgboost, (12 more...)

#artificialintelligence

Industry: Law Enforcement & Public Safety > Fraud (0.36)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.98)

Add feedback

Predicting Census Survey Response Rates via Interpretable Nonparametric Additive Models with Structured Interactions

Ibrahim, Shibal, Mazumder, Rahul, Radchenko, Peter, Ben-David, Emanuel

arXiv.org Machine LearningAug-24-2021

Accurate and interpretable prediction of survey response rates is important from an operational standpoint. The US Census Bureau's well-known ROAM application uses principled statistical models trained on the US Census Planning Database data to identify hard-to-survey areas. An earlier crowdsourcing competition revealed that an ensemble of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to limited interpretability. In this paper, we present new interpretable statistical methods to predict, with high accuracy, response rates in surveys. We study sparse nonparametric additive models with pairwise interactions via $\ell_0$-regularization, as well as hierarchically structured variants that provide enhanced interpretability. Despite strong methodological underpinnings, such models can be computationally challenging -- we present new scalable algorithms for learning these models. We also establish novel non-asymptotic error bounds for the proposed estimators. Experiments based on the US Census Planning Database demonstrate that our methods lead to high-quality predictive models that permit actionable interpretability for different segments of the population. Interestingly, our methods provide significant gains in interpretability without losing in predictive performance to state-of-the-art black-box machine learning methods based on gradient boosting and feedforward neural networks. Our code implementation in python is available at https://github.com/ShibalIbrahim/Additive-Models-with-Structured-Interactions.

ac 13 17, interaction, interaction effect, (16 more...)

arXiv.org Machine Learning

2108.11328

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(7 more...)

Genre: Questionnaire & Opinion Survey (1.00)

Industry: Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.92)
(2 more...)

Add feedback