Goto

Collaborating Authors

 Ensemble Learning


Scalable Feature Selection for (Multitask) Gradient Boosted Trees

arXiv.org Machine Learning

Gradient Boosted Decision Trees (GBDTs) are widely used for building ranking and relevance models in search and recommendation. Considerations such as latency and interpretability dictate the use of as few features as possible to train these models. Feature selection in GBDT models typically involves heuristically ranking the features by importance and selecting the top few, or by performing a full backward feature elimination routine. On-the-fly feature selection methods proposed previously scale suboptimally with the number of features, which can be daunting in high dimensional settings. We develop a scalable forward feature selection variant for GBDT, via a novel group testing procedure that works well in high dimensions, and enjoys favorable theoretical performance and computational guarantees. We show via extensive experiments on both public and proprietary datasets that the proposed method offers significant speedups in training time, while being as competitive as existing GBDT methods in terms of model performance metrics. We also extend the method to the multitask setting, allowing the practitioner to select common features across tasks, as well as selecting task-specific features.


XGBoost Regression: Explain It To Me Like I'm 10

#artificialintelligence

When I was just starting on my quest to understand Machine Learning algorithms, I would get overwhelmed with all the math-y stuff. I found it difficult to understand the math behind an algorithm without fully grasping the intuition. So I would gravitate towards sources that completely broke down the algorithm into simple steps and made it digestible to someone who never even heard the word Algorithm before. Okay, that is a blatant exaggeration, but you know what I mean. So that's what I'm attempting to do now.


LightAutoML: AutoML Solution for a Large Financial Services Ecosystem

arXiv.org Machine Learning

In particular, our ecosystem has the satisfying the set of idiosyncratic requirements that this ecosystem following set of requirements: has for AutoML solutions. Our framework was piloted and deployed in numerous applications and performed at the level of - AutoML system should be able to work with different types the experienced data scientists while building high-quality ML of data collected from hundreds of different information models significantly faster than these data scientists. We also compare systems and often changes more rapidly than these systems the performance of our system with various general-purpose can be fully documented using metadata and painstakingly open source AutoML solutions and show that it performs better for preprocessed by data scientists for the ML tasks using ETL most of the ecosystem and OpenML problems. We also present the tools.


RF-LighGBM: A probabilistic ensemble way to predict customer repurchase behaviour in community e-commerce

arXiv.org Artificial Intelligence

It is reported that the number of online payment users in China has reached 854 million; with the emergence of community e-commerce platforms, the trend of integration of e-commerce and social applications is increasingly intense. Community e-commerce is not a mature and sound comprehensive e-commerce with fewer categories and low brand value. To effectively retain community users and fully explore customer value has become an important challenge for community e-commerce operators. Given the above problems, this paper uses the data-driven method to study the prediction of community e-commerce customers' repurchase behaviour. The main research contents include 1. Given the complex problem of feature engineering, the classic model RFM in the field of customer relationship management is improved, and an improved model is proposed to describe the characteristics of customer buying behaviour, which includes five indicators. 2. In view of the imbalance of machine learning training samples in SMOTE-ENN, a training sample balance using SMOTE-ENN is proposed. The experimental results show that the machine learning model can be trained more effectively on balanced samples. 3. Aiming at the complexity of the parameter adjustment process, an automatic hyperparameter optimization method based on the TPE method was proposed. Compared with other methods, the model's prediction performance is improved, and the training time is reduced by more than 450%. 4. Aiming at the weak prediction ability of a single model, the soft voting based RF-LightgBM model was proposed. The experimental results show that the RF-LighTGBM model proposed in this paper can effectively predict customer repurchase behaviour, and the F1 value is 0.859, which is better than the single model and previous research results.


When are Deep Networks really better than Random Forests at small sample sizes?

arXiv.org Artificial Intelligence

Random forests (RF) and deep networks (DN) are two of the most popular machine learning methods in the current scientific literature and yield differing levels of performance on different data modalities. We wish to further explore and establish the conditions and domains in which each approach excels, particularly in the context of sample size and feature dimension. To address these issues, we tested the performance of these approaches across tabular, image, and audio settings using varying model parameters and architectures. Our focus is on datasets with at most 10,000 samples, which represent a large fraction of scientific and biomedical datasets. In general, we found RF to excel at tabular and structured data (image and audio) with small sample sizes, whereas DN performed better on structured data with larger sample sizes. Although we plan to continue updating this technical report in the coming months, we believe the current preliminary results may be of interest to others.


Ovarian Cancer Prediction from Ovarian Cysts Based on TVUS Using Machine Learning Algorithms

arXiv.org Machine Learning

Ovarian Cancer (OC) is type of female reproductive malignancy which can be found among young girls and mostly the women in their fertile or reproductive. There are few number of cysts are dangerous and may it cause cancer. So, it is very important to predict and it can be from different types of screening are used for this detection using Transvaginal Ultrasonography (TVUS) screening. In this research, we employed an actual datasets called PLCO with TVUS screening and three machine learning (ML) techniques, respectively Random Forest KNN, and XGBoost within three target variables. We obtained a best performance from this algorithms as far as accuracy, recall, f1 score and precision with the approximations of 99.50%, 99.50%, 99.49% and 99.50% individually. The AUC score of 99.87%, 98.97% and 99.88% are observed in these Random Forest, KNN and XGB algorithms .This approach helps assist physicians and suspects in identifying ovarian risks early on, reducing ovarian malignancy-related complications and deaths.


Survival Prediction of Heart Failure Patients using Stacked Ensemble Machine Learning Algorithm

arXiv.org Machine Learning

Cardiovascular disease, especially heart failure is one of the major health hazard issues of our time and is a leading cause of death worldwide. Advancement in data mining techniques using machine learning (ML) models is paving promising prediction approaches. Data mining is the process of converting massive volumes of raw data created by the healthcare institutions into meaningful information that can aid in making predictions and crucial decisions. Collecting various follow-up data from patients who have had heart failures, analyzing those data, and utilizing several ML models to predict the survival possibility of cardiovascular patients is the key aim of this study. Due to the imbalance of the classes in the dataset, Synthetic Minority Oversampling Technique (SMOTE) has been implemented. Two unsupervised models (K-Means and Fuzzy C-Means clustering) and three supervised classifiers (Random Forest, XGBoost and Decision Tree) have been used in our study. After thorough investigation, our results demonstrate a superior performance of the supervised ML algorithms over unsupervised models. Moreover, we designed and propose a supervised stacked ensemble learning model that can achieve an accuracy, precision, recall and F1 score of 99.98%. Our study shows that only certain attributes collected from the patients are imperative to successfully predict the surviving possibility post heart failure, using supervised ML algorithms.


Identification of the Resting Position Based on EGG, ECG, Respiration Rate and SpO2 Using Stacked Ensemble Learning

arXiv.org Machine Learning

Rest is essential for a high-level physiological and psychological performance. It is also necessary for the muscles to repair, rebuild, and strengthen. There is a significant correlation between the quality of rest and the resting posture. Therefore, identification of the resting position is of paramount importance to maintain a healthy life. Resting postures can be classified into four basic categories: Lying on the back (supine), facing of the left / right sides and free-fall position. The later position is already considered to be an unhealthy posture by researchers equivocally and hence can be eliminated. In this paper, we analyzed the other three states of resting position based on the data collected from the physiological parameters: Electrogastrogram (EGG), Electrocardiogram (ECG), Respiration Rate, Heart Rate, and Oxygen Saturation (SpO2). Based on these parameters, the resting position is classified using a hybrid stacked ensemble machine learning model designed using the Decision tree, Random Forest, and Xgboost algorithms. Our study demonstrates a 100% accurate prediction of the resting position using the hybrid model. The proposed method of identifying the resting position based on physiological parameters has the potential to be integrated into wearable devices. This is a low cost, highly accurate and autonomous technique to monitor the body posture while maintaining the user privacy by eliminating the use of RGB camera conventionally used to conduct the polysomnography (sleep Monitoring) or resting position studies.


A guide to XGBoost hyperparameters

#artificialintelligence

What is the one machine learning algorithm -- if you ask -- that consistently gives superior performance in regression and classification? It is arguably the most powerful algorithm and is increasingly being used in all industries and in all problem domains --from customer analytics and sales prediction to fraud detection and credit approval and more. It is also a winning algorithm in many machine learning competitions. In fact, XGBoost was used in 17 out of 29 data science competitions on the Kaggle platform. Not just in businesses and competitions, XGBoost has been used in scientific experiments such as the Large Hadron Collider (the Higgs Boson machine learning challenge). A key to its performance is its hyperparameters.


Predicting Census Survey Response Rates via Interpretable Nonparametric Additive Models with Structured Interactions

arXiv.org Machine Learning

Accurate and interpretable prediction of survey response rates is important from an operational standpoint. The US Census Bureau's well-known ROAM application uses principled statistical models trained on the US Census Planning Database data to identify hard-to-survey areas. An earlier crowdsourcing competition revealed that an ensemble of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to limited interpretability. In this paper, we present new interpretable statistical methods to predict, with high accuracy, response rates in surveys. We study sparse nonparametric additive models with pairwise interactions via $\ell_0$-regularization, as well as hierarchically structured variants that provide enhanced interpretability. Despite strong methodological underpinnings, such models can be computationally challenging -- we present new scalable algorithms for learning these models. We also establish novel non-asymptotic error bounds for the proposed estimators. Experiments based on the US Census Planning Database demonstrate that our methods lead to high-quality predictive models that permit actionable interpretability for different segments of the population. Interestingly, our methods provide significant gains in interpretability without losing in predictive performance to state-of-the-art black-box machine learning methods based on gradient boosting and feedforward neural networks. Our code implementation in python is available at https://github.com/ShibalIbrahim/Additive-Models-with-Structured-Interactions.