Ensemble Learning
Context-aware Retail Product Recommendation with Regularized Gradient Boosting
Das, Sourya Dipta, Basak, Ayan
In the FARFETCH Fashion Recommendation challenge, the participants needed to predict the order in which various products would be shown to a user in a recommendation impression. The data was provided in two phases - a validation phase and a test phase. The validation phase had a labelled training set that contained a binary column indicating whether a product has been clicked or not. The dataset comprises over 5,000,000 recommendation events, 450,000 products and 230,000 unique users. It represents real, unbiased, but anonymised, interactions of actual users of the FARFETCH platform. The final evaluation was done according to the performance in the second phase. A total of 167 participants participated in the challenge, and we secured the 6th rank during the final evaluation with an MRR of 0.4658 on the test set. We have designed a unique context-aware system that takes the similarity of a product to the user context into account to rank products more effectively. Post evaluation, we have been able to fine-tune our approach with an MRR of 0.4784 on the test set, which would have placed us at the 3rd position.
WildWood: a new Random Forest algorithm
Gaรฏffas, Stรฉphane, Merad, Ibrahim, Yu, Yiyang
We introduce WildWood (WW), a new ensemble algorithm for supervised learning of Random Forest (RF) type. While standard RF algorithms use bootstrap out-of-bag samples to compute out-of-bag scores, WW uses these samples to produce improved predictions given by an aggregation of the predictions of all possible subtrees of each fully grown tree in the forest. This is achieved by aggregation with exponential weights computed over out-of-bag samples, that are computed exactly and very efficiently thanks to an algorithm called context tree weighting. This improvement, combined with a histogram strategy to accelerate split finding, makes WW fast and competitive compared with other well-established ensemble methods, such as standard RF and extreme gradient boosting algorithms.
Fake News Detection Using Machine Learning Ensemble Methods
The advent of the World Wide Web and the rapid adoption of social media platforms (such as Facebook and Twitter) paved the way for information dissemination that has never been witnessed in the human history before. With the current usage of social media platforms, consumers are creating and sharing more information than ever before, some of which are misleading with no relevance to reality. Automated classification of a text article as misinformation or disinformation is a challenging task. Even an expert in a particular domain has to explore multiple aspects before giving a verdict on the truthfulness of an article. In this work, we propose to use machine learning ensemble approach for automated classification of news articles. Our study explores different textual properties that can be used to distinguish fake contents from real. By using those properties, we train a combination of different machine learning algorithms using various ensemble methods and evaluate their performance on 4 real world datasets. Experimental evaluation confirms the superior performance of our proposed ensemble learner approach in comparison to individual learners. The advent of the World Wide Web and the rapid adoption of social media platforms (such as Facebook and Twitter) paved the way for information dissemination that has never been witnessed in the human history before. Besides other use cases, news outlets benefitted from the widespread use of social media platforms by providing updated news in near real time to its subscribers. The news media evolved from newspapers, tabloids, and magazines to a digital form such as online news platforms, blogs, social media feeds, and other digital media formats [1]. It became easier for consumers to acquire the latest news at their fingertips.
Generalized XGBoost Method
This method has achieved excellent predictive performance in many fields and has exhibited many advantages, and is consequently considered especially suitable for the statistical analysis of big data. However, this method is limited because its loss function must be convex. For many scenario-specific problems, such as non-life insurance pricing, the distribution of predictor variables is often heavytailed, so the optimal prediction performance may not be obtained by setting convex loss functions. Simultaneously, it is important to estimate the probability distribution of predictor variables. When the set parametric probability distribution contains more than two parameters, it may be necessary to model multiple parameters to obtain better prediction performance. Therefore, a more generalized regularized tree boosting method is required to make the loss function not limited to the convex function while modelling the tree boosting for multiple parameters, to adapt to the most common parametric probability distributions.
Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection
Adler, Afek Ilay, Painsky, Amichai
Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state of the art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We show that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the suggested framework in a variety of synthetic and real-world setups, showing a significant improvement in all GBM FI measures while maintaining relatively the same level of prediction accuracy.
Secondary control activation analysed and predicted with explainable AI
Kruse, Johannes, Schรคfer, Benjamin, Witthaut, Dirk
The transition to a renewable energy system poses challenges for power grid operation and stability. Secondary control is key in restoring the power system to its reference following a disturbance. Underestimating the necessary control capacity may require emergency measures, such as load shedding. Hence, a solid understanding of the emerging risks and the driving factors of control is needed. In this contribution, we establish an explainable machine learning model for the activation of secondary control power in Germany. Training gradient boosted trees, we obtain an accurate description of control activation. Using SHapely Additive exPlanation (SHAP) values, we investigate the dependency between control activation and external features such as the generation mix, forecasting errors, and electricity market data. Thereby, our analysis reveals drivers that lead to high reserve requirements in the German power system. Our transparent approach, utilizing open data and making machine learning models interpretable, opens new scientific discovery avenues.
Automated Security Assessment for the Internet of Things
Duan, Xuanyu, Ge, Mengmeng, Le, Triet H. M., Ullah, Faheem, Gao, Shang, Lu, Xuequan, Babar, M. Ali
Internet of Things (IoT) based applications face an increasing number of potential security risks, which need to be systematically assessed and addressed. Expert-based manual assessment of IoT security is a predominant approach, which is usually inefficient. To address this problem, we propose an automated security assessment framework for IoT networks. Our framework first leverages machine learning and natural language processing to analyze vulnerability descriptions for predicting vulnerability metrics. The predicted metrics are then input into a two-layered graphical security model, which consists of an attack graph at the upper layer to present the network connectivity and an attack tree for each node in the network at the bottom layer to depict the vulnerability information. This security model automatically assesses the security of the IoT network by capturing potential attack paths. We evaluate the viability of our approach using a proof-of-concept smart building system model which contains a variety of real-world IoT devices and potential vulnerabilities. Our evaluation of the proposed framework demonstrates its effectiveness in terms of automatically predicting the vulnerability metrics of new vulnerabilities with more than 90% accuracy, on average, and identifying the most vulnerable attack paths within an IoT network. The produced assessment results can serve as a guideline for cybersecurity professionals to take further actions and mitigate risks in a timely manner.
Kaggle Competition -- Finding Donors for a Charity with an AUC of 0.94
Comparing Random Forest, Gradient Boosting, and XGBoost to select the best model to predict potential donors for a Charity. This project will employ 3 supervised algorithms, including Random Forest, Gradient Boosting, and XGBoost, to accurately model individuals' income using the 1994 U.S. Census data. I will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. My goal with this implementation is to construct a model that accurately predicts whether an individual makes more than 50,000 dollars. This sort of task can arise in a non-profit setting, where organizations survive on donations.