Goto

Collaborating Authors

 Ensemble Learning


Sarus just released DP-XGBoost

#artificialintelligence

XGBoost is one of the most popular gradient boosted trees library and is featured in many winning solutions on Kaggle competitions. It's written in C and useable in many languages: Python, R, Java, Julia, or Scala. It can run on major distributed environments (Kubernetes, Apache Spark, or Dask) to handle datasets with billions of examples. XGBoost is often used to train models on sensitive data. Since it comes with no privacy guarantee, one can show that personal information may remain in the model weights.


Diagnosing Web Data of ICTs to Provide Focused Assistance in Agricultural Adoptions

arXiv.org Artificial Intelligence

The past decade has witnessed a rapid increase in technology ownership across rural areas of India, signifying the potential for ICT initiatives to empower rural households. In our work, we focus on the web infrastructure of one such ICT - Digital Green that started in 2008. Following a participatory approach for content production, Digital Green disseminates instructional agricultural videos to smallholder farmers via human mediators to improve the adoption of farming practices. Their web-based data tracker, CoCo, captures data related to these processes, storing the attendance and adoption logs of over 2.3 million farmers across three continents and twelve countries. Using this data, we model the components of the Digital Green ecosystem involving the past attendance-adoption behaviours of farmers, the content of the videos screened to them and their demographic features across five states in India. We use statistical tests to identify different factors which distinguish farmers with higher adoption rates to understand why they adopt more than others. Our research finds that farmers with higher adoption rates adopt videos of shorter duration and belong to smaller villages. The co-attendance and co-adoption networks of farmers indicate that they greatly benefit from past adopters of a video from their village and group when it comes to adopting practices from the same video. Following our analysis, we model the adoption of practices from a video as a prediction problem to identify and assist farmers who might face challenges in adoption in each of the five states. We experiment with different model architectures and achieve macro-f1 scores ranging from 79% to 89% using a Random Forest classifier. Finally, we measure the importance of different features using SHAP values and provide implications for improving the adoption rates of nearly a million farmers across five states in India.


Introduction to Boosted Trees

#artificialintelligence

Welcome to my new article series: Boosting algorithms in machine learning! This is Part 1 of the series. Here, I'll give you a short introduction to boosting, its objective, some key definitions and a list of boosting algorithms that we intend to cover in the next posts. You should be familiar with elementary tree-based machine learning models such as decision trees and random forests. In addition to that, it is recommended to have good knowledge of Python and its Scikit-learn library.


DP-XGBoost: Private Machine Learning at Scale

arXiv.org Artificial Intelligence

The big-data revolution announced ten years ago does not seem to have fully happened at the expected scale. One of the main obstacle to this, has been the lack of data circulation. And one of the many reasons people and organizations did not share as much as expected is the privacy risk associated with data sharing operations. There has been many works on practical systems to compute statistical queries with Differential Privacy (DP). There have also been practical implementations of systems to train Neural Networks with DP, but relatively little efforts have been dedicated to designing scalable classical Machine Learning (ML) models providing DP guarantees. In this work we describe and implement a DP fork of a battle tested ML model: XGBoost. Our approach beats by a large margin previous attempts at the task, in terms of accuracy achieved for a given privacy budget. It is also the only DP implementation of boosted trees that scales to big data and can run in distributed environments such as: Kubernetes, Dask or Apache Spark.


BetaBoosting

#artificialintelligence

At this point, we all know of XGBoost due to the massive success it has had in numerous Data Science competitions held on platforms like Kaggle. Along with its success, we have seen several variations such as CatBoost and LightGBM. All of these implementations are based on the Gradient Boosting algorithm developed by Friedman¹, which involves iteratively building an ensemble of weak learners (usually decision trees) where each subsequent learner is trained on the previous learner's errors. Let's take a look at some general pseudo-code for the algorithm from Elements of Statistical Learning²: However, this is not complete! A core mechanism which allows boosting to work is a shrinkage parameter that penalizes each learner at each boosting round that is commonly called the'learning rate'.


How and why to build your own gradient boosted-tree package

#artificialintelligence

In order to make accurate and fast travel-time predictions, Lyft built a gradient boosted tree (GBT) package from the ground up. It is slower to train than off-the-shelf packages, but can be customized to treat space and time more efficiently and yield less volatile predictions. Machine learning runs at the core of what we do at Lyft. Examples include predicting travel time between two locations, modeling the probability of a ride being canceled, forecasting supply and demand, and many more. These models enable us to match riders and drivers more efficiently, incentivize drivers to be where they can get more rides, and improve the ride experience.


SecureBoost+ : A High Performance Gradient Boosting Tree Framework for Large Scale Vertical Federated Learning

arXiv.org Artificial Intelligence

Gradient boosting decision tree (GBDT) is a widely used ensemble algorithm in the industry. Its vertical federated learning version, SecureBoost, is one of the most popular algorithms used in cross-silo privacy-preserving modeling. As the area of privacy computation thrives in recent years, demands for large-scale and high-performance federated learning have grown dramatically in real-world applications. In this paper, to fulfill these requirements, we propose SecureBoost+ that is both novel and improved from the prior work SecureBoost. SecureBoost+ integrates several ciphertext calculation optimizations and engineering optimizations. The experimental results demonstrate that Secureboost+ has significant performance improvements on large and high dimensional data sets compared to SecureBoost. It makes effective and efficient large-scale vertical federated learning possible.


Power Transformer Fault Diagnosis with Intrinsic Time-scale Decomposition and XGBoost Classifier

arXiv.org Machine Learning

An intrinsic time-scale decomposition (ITD) based method for power transformer fault diagnosis is proposed. Dissolved gas analysis (DGA) parameters are ranked according to their skewness, and then ITD based features extraction is performed. An optimal set of PRC features are determined by an XGBoost classifier. For classification purpose, an XGBoost classifier is used to the optimal PRC features set. The proposed method's performance in classification is studied using publicly available DGA data of 376 power transformers and employing an XGBoost classifier. The Proposed method achieves more than 95% accuracy and high sensitivity and F1-score, better than conventional methods and some recent machine learning-based fault diagnosis approaches. Moreover, it gives better Cohen Kappa and F1-score as compared to the recently introduced EMD-based hierarchical technique for fault diagnosis in power transformers.


Regression with Missing Data, a Comparison Study of TechniquesBased on Random Forests

arXiv.org Machine Learning

Random forests and recursive trees are widely used in applied statistics and computer science. The popularity of recursive trees relies on several factors: their easy interpretability, the fact that they can be used for both regression and classification tasks, the small number of hyper-parameters to be tuned and finally, their non-parametric nature that allows their use to infer arbitrarily complex relations between the input and the output space. A random forest combines several randomized trees, improving the prediction accuracy at a cost of a slight lost in interpretation. This technique is easily parallelizable which has made it one of the most popular tools for handling high dimensional data sets. It has been successfully involved in various practical problems, including chemioinformatics, ecology, 3D object recognition, bioinformatics and econometrics. Biau and Scornet (2016) present a detailed list of applications as well as a review on random forests. In the present work we have focused on the ability of random forests to deal with missing values.


E-Commerce Dispute Resolution Prediction

arXiv.org Artificial Intelligence

E-Commerce marketplaces support millions of daily transactions, and some disagreements between buyers and sellers are unavoidable. Resolving disputes in an accurate, fast, and fair manner is of great importance for maintaining a trustworthy platform. Simple cases can be automated, but intricate cases are not sufficiently addressed by hard-coded rules, and therefore most disputes are currently resolved by people. In this work we take a first step towards automatically assisting human agents in dispute resolution at scale. We construct a large dataset of disputes from the eBay online marketplace, and identify several interesting behavioral and linguistic patterns. We then train classifiers to predict dispute outcomes with high accuracy. We explore the model and the dataset, reporting interesting correlations, important features, and insights.