Goto

Collaborating Authors

 Ensemble Learning


Explainable expected goal models for performance analysis in football analytics

arXiv.org Artificial Intelligence

The expected goal provides a more representative measure of the team and player performance which also suit the low-scoring nature of football instead of score in modern football. The score of a match involves randomness and often may not represent the performance of the teams and players, therefore it has been popular to use the alternative statistics in recent years such as shots on target, ball possessions, and drills. To measure the probability of a shot being a goal by the expected goal, several features are used to train an expected goal model which is based on the event and tracking football data. The selection of these features, the size and date of the data, and the model which are used as the parameters that may affect the performance of the model. Using black-box machine learning models for increasing the predictive performance of the model decreases its interpretability that causes the loss of information that can be gathered from the model. This paper proposes an accurate expected goal model trained consisting of 315,430 shots from seven seasons between 2014-15 and 2020-21 of the top-five European football leagues. Moreover, this model is explained by using explainable artificial intelligence tool to obtain an explainable expected goal model for evaluating a team or player performance. To the best of our knowledge, this is the first paper that demonstrates a practical application of an explainable artificial intelligence tool aggregated profiles to explain a group of observations on an accurate expected goal model for monitoring the team and player performance. Moreover, these methods can be generalized to other sports branches.


Federated XGBoost on Sample-Wise Non-IID Data

arXiv.org Artificial Intelligence

Federated Learning (FL) is a paradigm for jointly training machine learning algorithms in a decentralized manner which allows for parties to communicate with an aggregator to create and train a model, without exposing the underlying raw data distribution of the local parties involved in the training process. Most research in FL has been focused on Neural Network-based approaches, however Tree-Based methods, such as XGBoost, have been underexplored in Federated Learning due to the challenges in overcoming the iterative and additive characteristics of the algorithm. Decision tree-based models, in particular XGBoost, can handle non-IID data, which is significant for algorithms used in Federated Learning frameworks since the underlying characteristics of the data are decentralized and have risks of being non-IID by nature. In this paper, we focus on investigating the effects of how Federated XGBoost is impacted by non-IID distributions by performing experiments on various sample size-based data skew scenarios and how these models perform under various non-IID scenarios. We conduct a set of extensive experiments across multiple different datasets and different data skew partitions. Our experimental results demonstrate that despite the various partition ratios, the performance of the models stayed consistent and performed close to or equally well against models that were trained in a centralized manner.


Improving debris flow evacuation alerts in Taiwan using machine learning

arXiv.org Artificial Intelligence

Taiwan has the highest susceptibility to and fatalities from debris flows worldwide. The existing debris flow warning system in Taiwan, which uses a time-weighted measure of rainfall, leads to alerts when the measure exceeds a predefined threshold. However, this system generates many false alarms and misses a substantial fraction of the actual debris flows. Towards improving this system, we implemented five machine learning models that input historical rainfall data and predict whether a debris flow will occur within a selected time. We found that a random forest model performed the best among the five models and outperformed the existing system in Taiwan. Furthermore, we identified the rainfall trajectories strongly related to debris flow occurrences and explored trade-offs between the risks of missing debris flows versus frequent false alerts. These results suggest the potential for machine learning models trained on hourly rainfall data alone to save lives while reducing false alerts.


Nethra Sambamoorthi, PhD on LinkedIn: #datasciences #machinelearning #artificialintelligence

#artificialintelligence

Productive Analytics managers are investigative analytics managers. Top 5 models one should try are (1) XGboost, (2) L1, L2, glmnet regressions, (3) random forest, (4) SVM, (5) regularized NN, if time permits. Usually XGBoost and glmnet (hybrid of L1 and L2), and SVM should be good for most of the purposes that also provides interpretable models, while NN is good for voice/txt/images/videos. - Ask to verify "Be very clear as to which one is "1" or " " - is it target or reference in binary target variable? Also whether confusion matrix takes care of that correctly and hence whether the "key" ratios of row percentages or column percentages . For various internal programming setup in a software, different software or the modeling specific API deals may deal with this differently. Top models will use the leaky data rather than be good general model of the underlying problem. When you are a company providing your data. Reversing an anonymization and obfuscation can result in a privacy breach that you did not expect. It is a problem when you are developing your own predictive models. You may be creating overly optimistic models that are practically useless and cannot be used in production. " The solution is to use proper verification of actual distributional values of validation data set and the inherent data distributions of the features - For more, see: https://lnkd.in/gyWhs2sg


Predicting spatial distribution of Palmer Drought Severity Index

arXiv.org Artificial Intelligence

The probability of a drought for a particular region is crucial when making decisions related to agriculture. Forecasting this probability is critical for management and challenging at the same time. The prediction model should consider multiple factors with complex relationships across the region of interest and neighbouring regions. We approach this problem by presenting an end-to-end solution based on a spatio-temporal neural network. The model predicts the Palmer Drought Severity Index (PDSI) for subregions of interest. Predictions by climate models provide an additional source of knowledge of the model leading to more accurate drought predictions. Our model has better accuracy than baseline Gradient boosting solutions, as the $R^2$ score for it is $0.90$ compared to $0.85$ for Gradient boosting. Specific attention is on the range of applicability of the model. We examine various regions across the globe to validate them under different conditions. We complement the results with an analysis of how future climate changes for different scenarios affect the PDSI and how our model can help to make better decisions and more sustainable economics.


The Infinitesimal Jackknife and Combinations of Models

arXiv.org Artificial Intelligence

The Infinitesimal Jackknife is a general method for estimating variances of parametric models, and more recently also for some ensemble methods. In this paper we extend the Infinitesimal Jackknife to estimate the covariance between any two models. This can be used to quantify uncertainty for combinations of models, or to construct test statistics for comparing different models or ensembles of models fitted using the same training dataset. Specific examples in this paper use boosted combinations of models like random forests and M-estimators. We also investigate its application on neural networks and ensembles of XGBoost models. We illustrate the efficacy of variance estimates through extensive simulations and its application to the Beijing Housing data, and demonstrate the theoretical consistency of the Infinitesimal Jackknife covariance estimate.


Compound virtual screening by learning-to-rank with gradient boosting decision tree and enrichment-based cumulative gain

arXiv.org Artificial Intelligence

Learning-to-rank, a machine learning technique widely used in information retrieval, has recently been applied to the problem of ligand-based virtual screening, to accelerate the early stages of new drug development. Ranking prediction models learn based on ordinal relationships, making them suitable for integrating assay data from various environments. Existing studies of rank prediction in compound screening have generally used a learning-to-rank method called RankSVM. However, they have not been compared with or validated against the gradient boosting decision tree (GBDT)-based learning-to-rank methods that have gained popularity recently. Furthermore, although the ranking metric called Normalized Discounted Cumulative Gain (NDCG) is widely used in information retrieval, it only determines whether the predictions are better than those of other models. In other words, NDCG is incapable of recognizing when a prediction model produces worse than random results. Nevertheless, NDCG is still used in the performance evaluation of compound screening using learning-to-rank. This study used the GBDT model with ranking loss functions, called lambdarank and lambdaloss, for ligand-based virtual screening; results were compared with existing RankSVM methods and GBDT models using regression. We also proposed a new ranking metric, Normalized Enrichment Discounted Cumulative Gain (NEDCG), which aims to properly evaluate the goodness of ranking predictions. Results showed that the GBDT model with learning-to-rank outperformed existing regression methods using GBDT and RankSVM on diverse datasets. Moreover, NEDCG showed that predictions by regression were comparable to random predictions in multi-assay, multi-family datasets, demonstrating its usefulness for a more direct assessment of compound screening performance.


Machine Learning Models Evaluation and Feature Importance Analysis on NPL Dataset

arXiv.org Artificial Intelligence

Predicting the probability of non-performing loans for individuals has a vital and beneficial role for banks to decrease credit risk and make the right decisions before giving the loan. The trend to make these decisions are based on credit study and in accordance with generally accepted standards, loan payment history, and demographic data of the clients. In this work, we evaluate how different Machine learning models such as Random Forest, Decision tree, KNN, SVM, and XGBoost perform on the dataset provided by a private bank in Ethiopia. Further, motivated by this evaluation we explore different feature selection methods to state the important features for the bank. Our findings show that XGBoost achieves the highest F1 score on the KMeans SMOTE over-sampled data. We also found that the most important features are the age of the applicant, years of employment, and total income of the applicant rather than collateral-related features in evaluating credit risk. Work done when the authors were a research intern at Chapa. Equally contributed to this work.


Extreme Gradient Boosting for Yield Estimation compared with Deep Learning Approaches

arXiv.org Artificial Intelligence

Accurate prediction of crop yield before harvest is of great importance for crop logistics, market planning, and food distribution around the world. Yield prediction requires monitoring of phenological and climatic characteristics over extended time periods to model the complex relations involved in crop development. Remote sensing satellite images provided by various satellites circumnavigating the world are a cheap and reliable way to obtain data for yield prediction. The field of yield prediction is currently dominated by Deep Learning approaches. While the accuracies reached with those approaches are promising, the needed amounts of data and the ``black-box'' nature can restrict the application of Deep Learning methods. The limitations can be overcome by proposing a pipeline to process remote sensing images into feature-based representations that allow the employment of Extreme Gradient Boosting (XGBoost) for yield prediction. A comparative evaluation of soybean yield prediction within the United States shows promising prediction accuracies compared to state-of-the-art yield prediction systems based on Deep Learning. Feature importances expose the near-infrared spectrum of light as an important feature within our models. The reported results hint at the capabilities of XGBoost for yield prediction and encourage future experiments with XGBoost for yield prediction on other crops in regions all around the world.


Credit card fraud detection - Classifier selection strategy

arXiv.org Artificial Intelligence

Machine learning has opened up new tools for financial fraud detection. Using a sample of annotated transactions, a machine learning classification algorithm learns to detect frauds. With growing credit card transaction volumes and rising fraud percentages there is growing interest in finding appropriate machine learning classifiers for detection. However, fraud data sets are diverse and exhibit inconsistent characteristics. As a result, a model effective on a given data set is not guaranteed to perform on another. Further, the possibility of temporal drift in data patterns and characteristics over time is high. Additionally, fraud data has massive and varying imbalance. In this work, we evaluate sampling methods as a viable pre-processing mechanism to handle imbalance and propose a data-driven classifier selection strategy for characteristic highly imbalanced fraud detection data sets. The model derived based on our selection strategy surpasses peer models, whilst working in more realistic conditions, establishing the effectiveness of the strategy.