Goto

Collaborating Authors

 Regression


Betting the system: Using lineups to predict football scores

arXiv.org Artificial Intelligence

This paper aims to reduce randomness in football by analysing the role of lineups in final scores using machine learning prediction models we have developed. Football clubs invest millions of dollars on lineups and knowing how individual statistics translate to better outcomes can optimise investments. Moreover, sports betting is growing exponentially and being able to predict the future is profitable and desirable. We use machine learning models and historical player data from English Premier League (2020-2022) to predict scores and to understand how individual performance can improve the outcome of a match. We compared different prediction techniques to maximise the possibility of finding useful models. We created heuristic and machine learning models predicting football scores to compare different techniques. We used different sets of features and shown goalkeepers stats are more important than attackers stats to predict goals scored. We applied a broad evaluation process to assess the efficacy of the models in real world applications. We managed to predict correctly all relegated teams after forecast 100 consecutive matches. We show that Support Vector Regression outperformed other techniques predicting final scores and that lineups do not improve predictions. Finally, our model was profitable (42% return) when emulating a betting system using real world odds data.


Feature Transformation for Multiple Linear Regression in Python

#artificialintelligence

Data processing and transformation is an iterative process and in a way, it can never be'perfect'. Because as we gain more understanding on the dataset, such as the inner relationships between target variable and features, or the business context, we think of new ways to deal with them. Recently I started working on media mix models and some predictive models utilizing multiple linear regression. In this post, I will introduce the thought process and different ways to deal with variables for modeling purpose. I will use King County house price data set (a modified version for more fun) as an example.


Data-Driven Estimation of Heterogeneous Treatment Effects

arXiv.org Artificial Intelligence

Estimating the effect of a treatment on an outcome is a fundamental problem in many fields such as medicine [33, 34, 61], public policy [20] and more [2, 37]. For example, doctors might be interested in how a treatment, such as a drug, affects the recovery of patients [18], economists may be interested in how a job training program affects employment prospectives [35], and advertisers may want to model the average effect an advertisement has on sales [36]. However, individuals may react differently to the treatment of interest, and knowing only the average treatment effect in the population is insufficient. For example, a drug may have adverse effects on some individuals but not others [61], or a person's education and background may affect how much they benefit from job training [35, 50]. Measuring the extent to which different individuals react differently to treatment is known as heterogeneous treatment effect (HTE) estimation. Traditionally, HTE estimation has been done through subgroup analysis [9, 19]. However, this can lead to cherry-picking since the practitioner is the one who identifies subgroups for estimating effects. Recently, there has been more focus on data-driven estimation of heterogeneous treatment effects by letting the data identify which features are important for treatment effect estimation using machine learning techniques [28, 39, 61, 69]. A straightforward approach is to create interaction terms between all covariates and use them in a regression [6].


Computational Assessment of Hyperpartisanship in News Titles

arXiv.org Artificial Intelligence

We first adopt a human-guided machine learning framework to develop a new dataset for hyperpartisan news title detection with 2,200 manually labeled and 1.8 million machine-labeled titles that were posted from 2014 to the present by nine representative media organizations across three media bias groups - Left, Central, and Right in an active learning manner. The fine-tuned transformer-based language model achieves an overall accuracy of 0.84 and an F1 score of 0.78 on an external validation set. Next, we conduct a computational analysis to quantify the extent and dynamics of partisanship in news titles. While some aspects are as expected, our study reveals new or nuanced differences between the three media groups. We find that overall the Right media tends to use proportionally more hyperpartisan titles. Roughly around the 2016 Presidential Election, the proportions of hyperpartisan titles increased in all media bias groups where the relative increase in the proportion of hyperpartisan titles of the Left media was the most. We identify three major topics including foreign issues, political systems, and societal issues that are suggestive of hyperpartisanship in news titles using logistic regression models and the Shapley values. Through an analysis of the topic distribution, we find that societal issues gradually receive more attention from all media groups. We further apply a lexicon-based language analysis tool to the titles of each topic and quantify the linguistic distance between any pairs of the three media groups. Three distinct patterns are discovered. The Left media is linguistically more different from Central and Right in terms of foreign issues. The linguistic distance between the three media groups becomes smaller over recent years. In addition, a seasonal pattern where linguistic difference is associated with elections is observed for societal issues.


ML Basics (Part-1): REGRESSION -- A Gateway Method to Machine Learning

#artificialintelligence

There has been growing interest in the introductory posts on the elementary topics in Machine Learning. So, I am writing on such topics in the coming posts starting from this one. This article is mostly self-contained however, it requires basic understanding of linear algebra, and calculus. Regression is the process of estimating the relationship of a dependent variable (Y) with one or more independent variables (Xi). It is used primarily for finding patterns in a given set of data samples and forecasting the value of a variable while given a set of values of other variables.


Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency

arXiv.org Machine Learning

We study optimal procedures for estimating a linear functional based on observational data. In many problems of this kind, a widely used assumption is strict overlap, i.e., uniform boundedness of the importance ratio, which measures how well the observational data covers the directions of interest. When it is violated, the classical semi-parametric efficiency bound can easily become infinite, so that the instance-optimal risk depends on the function class used to model the regression function. For any convex and symmetric function class $\mathcal{F}$, we derive a non-asymptotic local minimax bound on the mean-squared error in estimating a broad class of linear functionals. This lower bound refines the classical semi-parametric one, and makes connections to moduli of continuity in functional estimation. When $\mathcal{F}$ is a reproducing kernel Hilbert space, we prove that this lower bound can be achieved up to a constant factor by analyzing a computationally simple regression estimator. We apply our general results to various families of examples, thereby uncovering a spectrum of rates that interpolate between the classical theories of semi-parametric efficiency (with $\sqrt{n}$-consistency) and the slower minimax rates associated with non-parametric function estimation.


Doubly Robust Counterfactual Classification

arXiv.org Artificial Intelligence

We study counterfactual classification as a new tool for decision-making under hypothetical (contrary to fact) scenarios. We propose a doubly-robust nonparametric estimator for a general counterfactual classifier, where we can incorporate flexible constraints by casting the classification problem as a nonlinear mathematical program involving counterfactuals. We go on to analyze the rates of convergence of the estimator and provide a closed-form expression for its asymptotic distribution. Our analysis shows that the proposed estimator is robust against nuisance model misspecification, and can attain fast $\sqrt{n}$ rates with tractable inference even when using nonparametric machine learning approaches. We study the empirical performance of our methods by simulation and apply them for recidivism risk prediction.


Interpretable and Scalable Graphical Models for Complex Spatio-temporal Processes

arXiv.org Artificial Intelligence

This thesis focuses on data that has complex spatio-temporal structure and on probabilistic graphical models that learn the structure in an interpretable and scalable manner. We target two research areas of interest: Gaussian graphical models for tensor-variate data and summarization of complex time-varying texts using topic models. This work advances the state-of-the-art in several directions. First, it introduces a new class of tensor-variate Gaussian graphical models via the Sylvester tensor equation. Second, it develops an optimization technique based on a fast-converging proximal alternating linearized minimization method, which scales tensor-variate Gaussian graphical model estimations to modern big-data settings. Third, it connects Kronecker-structured (inverse) covariance models with spatio-temporal partial differential equations (PDEs) and introduces a new framework for ensemble Kalman filtering that is capable of tracking chaotic physical systems. Fourth, it proposes a modular and interpretable framework for unsupervised and weakly-supervised probabilistic topic modeling of time-varying data that combines generative statistical models with computational geometric methods. Throughout, practical applications of the methodology are considered using real datasets. This includes brain-connectivity analysis using EEG data, space weather forecasting using solar imaging data, longitudinal analysis of public opinions using Twitter data, and mining of mental health related issues using TalkLife data. We show in each case that the graphical modeling framework introduced here leads to improved interpretability, accuracy, and scalability.


A Coreset Learning Reality Check

arXiv.org Artificial Intelligence

Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification. While these works are supported by theory and limited experiments, to date there has not been a comprehensive evaluation of these methods. In our work, we directly compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness. In many cases, methods do not outperform simple uniform subsampling.


Novelty Detection in Network Traffic: Using Survival Analysis for Feature Identification

arXiv.org Artificial Intelligence

Intrusion Detection Systems are an important component of many organizations' cyber defense and resiliency strategies. However, one downside of these systems is their reliance on known attack signatures for detection of malicious network events. When it comes to unknown attack types and zero-day exploits, modern Intrusion Detection Systems often fall short. In this paper, we introduce an unconventional approach to identifying network traffic features that influence novelty detection based on survival analysis techniques. Specifically, we combine several Cox proportional hazards models and implement Kaplan-Meier estimates to predict the probability that a classifier identifies novelty after the injection of an unknown network attack at any given time. The proposed model is successful at pinpointing PSH Flag Count, ACK Flag Count, URG Flag Count, and Down/Up Ratio as the main features to impact novelty detection via Random Forest, Bayesian Ridge, and Linear Support Vector Regression classifiers.