Statistical Learning
Forecast Sports Outcomes under Efficient Market Hypothesis: Theoretical and Experimental Analysis of Odds-Only and Generalised Linear Models
Goto, Kaito, Takeishi, Naoya, Yairi, Takehisa
Converting betting odds into accurate outcome probabilities is a fundamental challenge in order to use betting odds as a benchmark for sports forecasting and market efficiency analysis. In this study, we propose two methods to overcome the limitations of existing conversion methods. Firstly, we propose an odds-only method to convert betting odds to probabilities without using historical data for model fitting. While existing odds-only methods, such as Multiplicative, Shin, and Power exist, they do not adjust for biases or relationships we found in our betting odds dataset, which consists of 90014 football matches across five different bookmakers. To overcome these limitations, our proposed Odds-Only-Equal-Profitability-Confidence (OO-EPC) method aligns with the bookmakers' pricing objectives of having equal confidence in profitability for each outcome. We provide empirical evidence from our betting odds dataset that, for the majority of bookmakers, our proposed OO-EPC method outperforms the existing odds-only methods. Beyond controlled experiments, we applied the OO-EPC method under real-world uncertainty by using it for six iterations of an annual basketball outcome forecasting competition. Secondly, we propose a generalised linear model that utilises historical data for model fitting and then converts betting odds to probabilities. Existing generalised linear models attempt to capture relationships that the Efficient Market Hypothesis already captures. To overcome this shortcoming, our proposed Favourite-Longshot-Bias-Adjusted Generalised Linear Model (FL-GLM) fits just one parameter to capture the favourite-longshot bias, providing a more interpretable alternative. We provide empirical evidence from historical football matches where, for all bookmakers, our proposed FL-GLM outperforms the existing multinomial and logistic generalised linear models.
PAC-Bayes Bounds for Gibbs Posteriors via Singular Learning Theory
We derive explicit non-asymptotic PAC-Bayes generalization bounds for Gibbs posteriors, that is, data-dependent distributions over model parameters obtained by exponentially tilting a prior with the empirical risk. Unlike classical worst-case complexity bounds based on uniform laws of large numbers, which require explicit control of the model space in terms of metric entropy (integrals), our analysis yields posterior-averaged risk bounds that can be applied to overparameterized models and adapt to the data structure and the intrinsic model complexity. The bound involves a marginal-type integral over the parameter space, which we analyze using tools from singular learning theory to obtain explicit and practically meaningful characterizations of the posterior risk. Applications to low-rank matrix completion and ReLU neural network regression and classification show that the resulting bounds are analytically tractable and substantially tighter than classical complexity-based bounds. Our results highlight the potential of PAC-Bayes analysis for precise finite-sample generalization guarantees in modern overparameterized and singular models.
A Bayesian Updating Framework for Long-term Multi-Environment Trial Data in Plant Breeding
Bark, Stephan, Malik, Waqas Ahmed, Prus, Maryna, Piepho, Hans-Peter, Schmid, Volker
In variety testing, multi-environment trials (MET) are essential for evaluating the genotypic performance of crop plants. A persistent challenge in the statistical analysis of MET data is the estimation of variance components, which are often still inaccurately estimated or shrunk to exactly zero when using residual (restricted) maximum likelihood (REML) approaches. At the same time, institutions conducting MET typically possess extensive historical data that can, in principle, be leveraged to improve variance component estimation. However, these data are rarely incorporated sufficiently. The purpose of this paper is to address this gap by proposing a Bayesian framework that systematically integrates historical information to stabilize variance component estimation and better quantify uncertainty. Our Bayesian linear mixed model (BLMM) reformulation uses priors and Markov chain Monte Carlo (MCMC) methods to maintain the variance components as positive, yielding more realistic distributional estimates. Furthermore, our model incorporates historical prior information by managing MET data in successive historical data windows. Variance component prior and posterior distributions are shown to be conjugate and belong to the inverse gamma and inverse Wishart families. While Bayesian methodology is increasingly being used for analyzing MET data, to the best of our knowledge, this study comprises one of the first serious attempts to objectively inform priors in the context of MET data. This refers to the proposed Bayesian updating approach. To demonstrate the framework, we consider an application where posterior variance component samples are plugged into an A-optimality experimental design criterion to determine the average optimal allocations of trials to agro-ecological zones in a sub-divided target population of environments (TPE).
PRIM-cipal components analysis
Liu, Tianhao, Dรญaz-Pachรณn, Daniel Andrรฉs, Rao, J. Sunil
EVEN supervised learning is subject to the famous NoFree Lunch Theorems [1]-[3], which say that, in combinatorial optimization, there is no universal algorithm that works better than its competitors for every objective function [4]-[6]. Indeed, David Wolpert has recently proven that, on average, cross-validation performs as well as anti-crossvalidation (choosing among a set of candidate algorithms based on which has the worst out-of-sample behavior) for supervised learning. Still, he acknowledges that "it is hard to imagine any scientist who would not prefer to use [crossvalidation] to using anti-cross-validation" [7]. On the other hand, unsupervised learning has seldom been studied from the perspective of the NFLTs. This may be because the adjective "unsupervised" suggests that no human input is needed, which is misleading as many unsupervised tasks are combinatorial optimization problems that depend on the choice of the objective function. For instance, it is well known that, among the eigenvectors of the covariance matrix, Principal Components Analysis selects those with the largest variances [8]. However, mode-hunting techniques that rely on spectral manipulation aim at the opposite objective: selecting the eigenvectors of the covariance matrix with the smallest variances [9], [10]. Therefore, unlike in supervised learning, where it is difficult to identify reasons to optimize with respect to anti-cross-validation, in unsupervised learning there are strong reasons to reduce dimensionality for variance minimization. D. A. D ฤฑaz-Pach on and T. Liu are with the Division of Biostatistics, University of Miami, Miami, FL, 33136 USA (e-mail: ddiaz3@miami.edu,
Spurious Predictability in Financial Machine Learning
Adaptive specification search generates statistically significant backtests even under martingale-difference nulls. We introduce a falsification audit testing complete predictive workflows against synthetic reference classes, including zero-predictability environments and microstructure placebos. Workflows generating significant walk-forward evidence in these environments are falsified. For passing workflows, we quantify selection-induced performance inflation using an absolute magnitude gap linking optimized in-sample evidence to disjoint walk-forward realizations, adjusted for effective multiplicity. Simulations validate extreme-value scaling under correlated searches and demonstrate detection power under genuine structure. Empirical case studies confirm that many apparent findings represent methodological artifacts rather than genuine predictability.
Beyond Augmented-Action Surrogates for Multi-Expert Learning-to-Defer
Montreuil, Yannis, Carlier, Axel, Ng, Lai Xing, Ooi, Wei Tsang
Existing multi-expert learning-to-defer surrogates are statistically consistent, yet they can underfit, suppress useful experts, or degrade as the expert pool grows. We trace these failures to a shared architectural choice: casting classes and experts as actions inside one augmented prediction geometry. Consistency governs the population target; it says nothing about how the surrogate distributes gradient mass during training. We analyze five surrogates along both axes and show that each trades a fix on one for a failure on the other. We then introduce a decoupled surrogate that estimates the class posterior with a softmax and each expert utility with an independent sigmoid. It admits an $\mathcal{H}$-consistency bound whose constant is $J$-independent for fixed per-expert weight $ฮฒ{=}ฮป/J$, and its gradients are free of the amplification, starvation, and coupling pathologies of the augmented family. Experiments on synthetic benchmarks, CIFAR-10, CIFAR-10H, and Covertype confirm that the decoupled surrogate is the only method that avoids amplification under redundancy, preserves rare specialists, and consistently improves over a standalone classifier across all settings.
Beyond Fixed False Discovery Rates: Post-Hoc Conformal Selection with E-Variables
Conformal selection (CS) uses calibration data to identify test inputs whose unobserved outcomes are likely to satisfy a pre-specified minimal quality requirement, while controlling the false discovery rate (FDR). Existing methods fix the target FDR level before observing data, which prevents the user from adapting the balance between number of selected test inputs and FDR to downstream needs and constraints based on the available data. For example, in genomics or neuroimaging, researchers often inspect the distribution of test statistics, and decide how aggressively to pursue candidates based on observed evidence strength and available follow-up resources. To address this limitation, we introduce {post-hoc CS} (PH-CS), which generates a path of candidate selection sets, each paired with a data-driven false discovery proportion (FDP) estimate. PH-CS lets the user select any operating point on this path by maximizing a user-specified utility, arbitrarily balancing selection size and FDR. Building on conformal e-variables and the e-Benjamini-Hochberg (e-BH) procedure, PH-CS is proved to provide a finite-sample post-hoc reliability guarantee whereby the ratio between estimated FDP level and true FDP is, on average, upper bounded by $1$, so that the average estimated FDP is, to first order, a valid upper bound on the true FDR. PH-CS is extended to control quality defined in terms of a general risk. Experiments on synthetic and real-world datasets demonstrate that, unlike CS, PH-CS can consistently satisfy user-imposed utility constraints while producing reliable FDP estimates and maintaining competitive FDR control.
Scalable Model-Based Clustering with Sequential Monte Carlo
Trojan, Connie, Myshkov, Pavel, Fearnhead, Paul, Hensman, James, Minka, Tom, Nemeth, Christopher
In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.
Improving Machine Learning Performance with Synthetic Augmentation
Sohm, Mel, Dezons, Charles, Sellami, Sami, Ninou, Oscar, Pincon, Axel
Synthetic augmentation is increasingly used to mitigate data scarcity in financial machine learning, yet its statistical role remains poorly understood. We formalize synthetic augmentation as a modification of the effective training distribution and show that it induces a structural bias--variance trade-off: while additional samples may reduce estimation error, they may also shift the population objective whenever the synthetic distribution deviates from regions relevant under evaluation. To isolate informational gains from mechanical sample-size effects, we introduce a size-matched null augmentation and a finite-sample, non-parametric block permutation test that remains valid under weak temporal dependence. We evaluate this framework in both controlled Markov-switching environments and real financial datasets, including high-frequency option trade data and a daily equity panel. Across generators spanning bootstrap, copula-based models, variational autoencoders, diffusion models, and TimeGAN, we vary augmentation ratio, model capacity, task type, regime rarity, and signal-to-noise. We show that synthetic augmentation is beneficial only in variance-dominant regimes, such as persistent volatility forecasting-while it deteriorates performance in bias-dominant settings, including near-efficient directional prediction. Rare-regime targeting can improve domain-specific metrics but may conflict with unconditional permutation inference. Our results provide a structural perspective on when synthetic data improves financial learning performance and when it induces persistent distributional distortion.
Generative Augmented Inference
Lu, Cheng, Wang, Mengxin, Zhang, Dennis J., Zhang, Heng
Data-driven operations management often relies on parameters estimated from costly human-generated labels. Recent advances in large language models (LLMs) and other AI systems offer inexpensive auxiliary data, but introduce a new challenge: AI outputs are not direct observations of the target outcomes, but could involve high-dimensional representations with complex and unknown relationships to human labels. Conventional methods leverage AI predictions as direct proxies for true labels, which can be inefficient or unreliable when this relationship is weak or misspecified. We propose Generative Augmented Inference (GAI), a general framework that incorporates AI-generated outputs as informative features for estimating models of human-labeled outcomes. GAI uses an orthogonal moment construction that enables consistent estimation and valid inference with flexible, nonparametric relationship between LLM-generated outputs and human labels. We establish asymptotic normality and show a "safe default" property: relative to human-data-only estimators, GAI weakly improves estimation efficiency under arbitrary auxiliary signals and yields strict gains whenever the auxiliary information is predictive. Empirically, GAI outperforms benchmarks across diverse settings. In conjoint analysis with weak auxiliary signals, GAI reduces estimation error by about 50% and lowers human labeling requirements by over 75%. In retail pricing, where all methods access the same auxiliary inputs, GAI consistently outperforms alternative estimators, highlighting the value of its construction rather than differences in information. In health insurance choice, it cuts labeling requirements by over 90% while maintaining decision accuracy. Across applications, GAI improves confidence interval coverage without inflating width. Overall, GAI provides a principled and scalable approach to integrating AI-generated information.