Goto

Collaborating Authors

 empirical analysis



An empirical analysis of compute-optimal large language model training

Neural Information Processing Systems

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more data.


Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models

Neural Information Processing Systems

The capabilities of natural language models trained on large-scale data have increased immensely over the past few years. Open source libraries such as HuggingFace have made these models easily available and accessible. While prior research has identified biases in large language models, this paper considers biases contained in the most popular versions of these models when applied `out-of-the-box' for downstream tasks. We focus on generative language models as they are well-suited for extracting biases inherited from training data. Specifically, we conduct an in-depth analysis of GPT-2, which is the most downloaded text generation model on HuggingFace, with over half a million downloads per month. We assess biases related to occupational associations for different protected categories by intersecting gender with religion, sexuality, ethnicity, political affiliation, and continental name origin. Using a template-based data collection pipeline, we collect 396K sentence completions made by GPT-2 and find: (i) The machine-predicted jobs are less diverse and more stereotypical for women than for men, especially for intersections; (ii) Intersectional interactions are highly relevant for occupational associations, which we quantify by fitting 262 logistic models; (iii) For most occupations, GPT-2 reflects the skewed gender and ethnicity distribution found in US Labor Bureau data, and even pulls the societally-skewed distribution towards gender parity in cases where its predictions deviate from real labor market observations. This raises the normative question of what language models \textit{should} learn - whether they should reflect or correct for existing inequalities.


BitSkip: An Empirical Analysis of Quantization and Early Exit Composition

arXiv.org Artificial Intelligence

The pursuit of efficient Large Language Models (LLMs) has led to increasingly complex techniques like extreme quantization and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically exploring these interactions. Counter-intuitively, our findings reveal that a simple 8-bit quantized model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8-bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with layer 18 providing optimal 32.5% speed gain for minimal 4% quality loss.


On Evaluating Loss Functions for Stock Ranking: An Empirical Analysis With Transformer Model

arXiv.org Artificial Intelligence

Quantitative trading strategies rely on accurately ranking stocks to identify profitable investments. Effective portfolio management requires models that can reliably order future stock returns. Transformer models are promising for understanding financial time series, but how different training loss functions affect their ability to rank stocks well is not yet fully understood. Financial markets are challenging due to their changing nature and complex relationships between stocks. Standard loss functions, which aim for simple prediction accuracy, often aren't enough. They don't directly teach models to learn the correct order of stock returns. While many advanced ranking losses exist from fields such as information retrieval, there hasn't been a thorough comparison to see how well they work for ranking financial returns, especially when used with modern Transformer models for stock selection. This paper addresses this gap by systematically evaluating a diverse set of advanced loss functions including pointwise, pairwise, listwise for daily stock return forecasting to facilitate rank-based portfolio selection on S&P 500 data. We focus on assessing how each loss function influences the model's ability to discern profitable relative orderings among assets. Our research contributes a comprehensive benchmark revealing how different loss functions impact a model's ability to learn cross-sectional and temporal patterns crucial for portfolio selection, thereby offering practical guidance for optimizing ranking-based trading strategies.


Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance

arXiv.org Artificial Intelligence

Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. In this paper, we identify a long-overlooked yet pervasive issue in this family of models: a misalignment between the pre-defined noise level and the actual noise level encoded in intermediate states during sampling. We refer to this misalignment as noise shift. Through empirical analysis, we demonstrate that noise shift is widespread in modern diffusion models and exhibits a systematic bias, leading to sub-optimal generation due to both out-of-distribution generalization and inaccurate denoising updates. To address this problem, we propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule. We further introduce a classifier-free variant of NAG, which jointly trains a noise-conditional and a noise-unconditional model via noise-condition dropout, thereby eliminating the need for external classifiers. Extensive experiments, including ImageNet generation and various supervised fine-tuning tasks, show that NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models.



Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

Show an illustration, what does it do intuitively? Example of how to choose the wild bootstrap process: for instance, the statistical learning theory reader might wonder whether it makes sense to use a correlated Rademacher or Gaussian process here.


Learning From Small Samples: An Analysis of Simple Decision Heuristics

Neural Information Processing Systems

Simple decision heuristics are models of human and animal behavior that use few pieces of information--perhaps only a single piece of information--and integrate the pieces in simple ways, for example, by considering them sequentially, one at a time, or by giving them equal weight. We focus on three families of heuristics: single-cue decision making, lexicographic decision making, and tallying. It is unknown how quickly these heuristics can be learned from experience. We show, analytically and empirically, that substantial progress in learning can be made with just a few training samples. When training samples are very few, tallying performs substantially better than the alternative methods tested. Our empirical analysis is the most extensive to date, employing 63 natural data sets on diverse subjects.


Empirical Analysis Of Heuristic and Approximation Algorithms for the The Mutual-Visibility Problem

arXiv.org Artificial Intelligence

The NP-complete mutual-visibility (MV) problem currently lacks empirical analysis on its practical behaviour despite theoretical studies. This paper addresses this gap by implementing and evaluating three distinct algorithms -- a direct random heuristic, a hypergraph-based approximation, and a genetic algorithm -- on diverse synthetic graph datasets, including those with analytically known $μ(G)$ values and general graph models. Our results demonstrate that for smaller graphs, the algorithms consistently achieve MV set sizes aligning with theoretical bounds. However, for larger instances, achieved solution sizes notably diverge from theoretical limits; this, combined with the absence of tight bounds, complicates absolute quality assessment. Nevertheless, validation on known optimal graphs showed the Genetic Algorithm and other heuristics empirically performing best among tested methods.