Goto

Collaborating Authors

 cross-validation fold


OpenFPL: An open-source forecasting method rivaling state-of-the-art Fantasy Premier League services

Groos, Daniel

arXiv.org Artificial Intelligence

Fantasy Premier League engages the football community in selecting the Premier League players who will perform best from gameweek to gameweek. Access to accurate performance forecasts gives participants an edge over competitors by guiding expectations about player outcomes and reducing uncertainty in squad selection. However, high-accuracy forecasts are currently limited to commercial services whose inner workings are undisclosed and that rely on proprietary data. This paper aims to democratize access to highly accurate forecasts of player performance by presenting OpenFPL, an open-source Fantasy Premier League forecasting method developed exclusively from public data. Comprising position-specific ensemble models optimized on Fantasy Premier League and Understat data from four previous seasons (2020-21 to 2023-24), OpenFPL achieves accuracy comparable to a leading commercial service when tested prospectively on data from the 2024-25 season. OpenFPL also surpasses the commercial benchmark for high-return players ($>$ 2 points), which are most influential for rank gains. These findings hold across one-, two-, and three-gameweek forecast horizons, supporting long-term planning of transfers and strategies while also informing final-day decisions.


The Harmonic Structure of Information Contours

Tsipidi, Eleftheria, Kiegeland, Samuel, Nowak, Franz, Xu, Tianyang, Wilcox, Ethan, Warstadt, Alex, Cotterell, Ryan, Giulianelli, Mario

arXiv.org Artificial Intelligence

The uniform information density (UID) hypothesis proposes that speakers aim to distribute information evenly throughout a text, balancing production effort and listener comprehension difficulty. However, language typically does not maintain a strictly uniform information rate; instead, it fluctuates around a global average. These fluctuations are often explained by factors such as syntactic constraints, stylistic choices, or audience design. In this work, we explore an alternative perspective: that these fluctuations may be influenced by an implicit linguistic pressure towards periodicity, where the information rate oscillates at regular intervals, potentially across multiple frequencies simultaneously. We apply harmonic regression and introduce a novel extension called time scaling to detect and test for such periodicity in information contours. Analyzing texts in English, Spanish, German, Dutch, Basque, and Brazilian Portuguese, we find consistent evidence of periodic patterns in information rate. Many dominant frequencies align with discourse structure, suggesting these oscillations reflect meaningful linguistic organization. Beyond highlighting the connection between information rate and discourse structure, our approach offers a general framework for uncovering structural pressures at various levels of linguistic granularity.


MONSTER: Monash Scalable Time Series Evaluation Repository

Dempster, Angus, Foumani, Navid Mohammadi, Tan, Chang Wei, Miller, Lynn, Mishra, Amish, Salehi, Mahsa, Pelletier, Charlotte, Schmidt, Daniel F., Webb, Geoffrey I.

arXiv.org Artificial Intelligence

We introduce Monster--the MONash Scalable Time Series E valuation R epository--a collection of large datasets for time series classification. The field of time series classification has benefitted from common benchmarks set by the UCR and UEA time series classification repositories. However, the datasets in these benchmarks are small, with median sizes of 217 and 255 examples, respectively. In consequence they favour a narrow subspace of models that are optimised to achieve low classification error on a wide variety of smaller datasets, that is, models that minimise variance, and give little weight to computational issues such as scalability. Our hope is to diversify the field by introducing benchmarks using larger datasets. We believe that there is enormous potential for new progress in the field by engaging with the theoretical and practical challenges of learning effectively from larger quantities of data.


Evaluating Deep Regression Models for WSI-Based Gene-Expression Prediction

Gustafsson, Fredrik K., Rantalainen, Mattias

arXiv.org Artificial Intelligence

Prediction of mRNA gene-expression profiles directly from routine whole-slide images (WSIs) using deep learning models could potentially offer cost-effective and widely accessible molecular phenotyping. While such WSI-based gene-expression prediction models have recently emerged within computational pathology, the high-dimensional nature of the corresponding regression problem offers numerous design choices which remain to be analyzed in detail. This study provides recommendations on how deep regression models should be trained for WSI-based gene-expression prediction. For example, we conclude that training a single model to simultaneously regress all 20530 genes is a computationally efficient yet very strong baseline.


The CAST package for training and assessment of spatial prediction models in R

Meyer, Hanna, Ludwig, Marvin, Milà, Carles, Linnenbrink, Jan, Schumacher, Fabian

arXiv.org Machine Learning

One key task in environmental science is to map environmental variables continuously in space or even in space and time. Machine learning algorithms are frequently used to learn from local field observations to make spatial predictions by estimating the value of the variable of interest in places where it has not been measured. However, the application of machine learning strategies for spatial mapping involves additional challenges compared to "non-spatial" prediction tasks that often originate from spatial autocorrelation and from training data that are not independent and identically distributed. In the past few years, we developed a number of methods to support the application of machine learning for spatial data which involves the development of suitable cross-validation strategies for performance assessment and model selection, spatial feature selection, and methods to assess the area of applicability of the trained models. The intention of the CAST package is to support the application of machine learning strategies for predictive mapping by implementing such methods and making them available for easy integration into modelling workflows. Here we introduce the CAST package and its core functionalities. At the case study of mapping plant species richness, we will go through the different steps of the modelling workflow and show how CAST can be used to support more reliable spatial predictions.


Seeing Numbers: Bayesian Optimisation of a LightGBM model

#artificialintelligence

In a classic case of "be careful what you search for," reading a couple of online articles on model hyper-parameter optimisation has lead to my news feed being bombarded with how-to guides guaranteeing "the most powerful model possible" "in a few easy steps." What I do notice however, is that few articles actually mention that hyper-parameter tuning is only part of the process and is not a silver bullet solution for predictive power. Even fewer articles mention that gains in predictive power from hyper-parameter optimisation are modest and are likely less than gains from decent feature engineering. LightGBM is a gradient boosting framework which uses tree-based learning algorithms. It is an example of an ensemble technique which combines weak individual models to form a single accurate model.