functional form
Online Generalised Predictive Coding
Bazargani, Mehran H. Z., Urbas, Szymon, Razi, Adeel, Murphy, Thomas Brendan, Friston, Karl
Despite being confined within the interior darkness of the skull, the human brain possesses a remarkable ability to interpret, understand and analyse the world out there, plan for unseen futures, and make decisions that can alter the course of events. This extraordinary capability is conjectured to come from the brain's function as a predictive machine, constantly inferring the hidden causes of its sensory inputs to maintain a coherent model of its environment. This view, which dates back to Helmholtz's idea of "perception as unconscious inference" (von Helmholtz, 1866)--evolving into the "Bayesian brain" hypothesis (Doya et al., 2007)--suggests that the brain operates as a constructive statistical organ. It updates its beliefs about the external world based on incoming sensory data under a generative model (GM). The GM furnishes the brain with a structured representation that supports probabilistic beliefs over both the latent dynamical states of the external world, corresponding to the generative process (GP), as well as the observation mappings through which these states give rise to sensory signals. Essentially, the brain continually refines its probabilistic beliefs about both the latent states and the causal mechanisms of the world through a process of online triple estimation, jointly optimising beliefs over: hidden states, model parameters, and their associated uncertainties in accordance with the principles of Bayesian inference (Eells, 2004; Parr et al., 2022). More technically, given a sensory observation yt at time t, perception can be formulated as an online triple estimation scheme, whose three components are: 1) online hidden state inference, 2) online parameter learning, and 3) online uncertainty estimation, all three of which are the core components of our proposed online generalised PC scheme and are elaborated in Section.
Large language models transition from integrating across position-yoked, exponential windows to structure-yoked, power-law windows
Modern language models excel at integrating across long temporal scales needed to encode linguistic meaning and show non-trivial similarities to biological neural systems. Prior work suggests that human brain responses to language exhibit hierarchically organized "integration windows" that substantially constrain the overall influence of an input token (e.g., a word) on the neural response. However, little prior work has attempted to use integration windows to characterize computations in large language models (LLMs). We developed a simple word-swap procedure for estimating integration windows from black-box language models that does not depend on access to gradients or knowledge of the model architecture (e.g., attention weights). Using this method, we show that trained LLMs exhibit stereotyped integration windows that are well-fit by a convex combination of an exponential and a power-law function, with a partial transition from exponential to power-law dynamics across network layers. We then introduce a metric for quantifying the extent to which these integration windows vary with structural boundaries (e.g., sentence boundaries), and using this metric, we show that integration windows become increasingly yoked to structure at later network layers. None of these findings were observed in an untrained model, which as expected integrated uniformly across its input. These results suggest that LLMs learn to integrate information in natural language using a stereotyped pattern: integrating across position-yoked, exponential windows at early layers, followed by structure-yoked, power-law windows at later layers. The methods we describe in this paper provide a general-purpose toolkit for understanding temporal integration in language models, facilitating cross-disciplinary research at the intersection of biological and artificial intelligence.
Dynamic Pricing with Monotonicity Constraint under Unknown Parametric Demand Model
We consider the Continuum Bandit problem where the goal is to find the optimal action under an unknown reward function, with an additional monotonicity constraint (or, markdown constraint) that requires that the action sequence be non-increasing. This problem faithfully models a natural single-product dynamic pricing problem, called markdown pricing, where the objective is to adaptively reduce the price over a finite sales horizon to maximize expected revenues. Jia et al '21 and Chen '21 independently showed a tight $T^{3/4}$ regret bound over $T$ rounds under *minimal* assumptions of unimodality and Lipschitzness in the reward (or, revenue) function. This bound shows that the demand learning in markdown pricing is harder than unconstrained (i.e., without the monotonicity constraint) pricing under unknown demand which suffers regret only of the order of $T^{2/3}$ under the same assumptions (Kleinberg '04). However, in practice the demand functions are usually assumed to have certain functional forms (e.g.
Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training
Krajewski, Jakub, Shidani, Amitis, Busbridge, Dan, Wiseman, Sam, Ramapuram, Jason
Large Language Models (OpenAI et al., 2024; Team et al., 2025; DeepSeek-AI et al., 2025) based on the Transformer (Vaswani et al., 2023) architecture have achieved impressive results, approaching or exceeding human-level performance across multiple domains. Scaling laws (Hestness et al., 2017; Kaplan et al., 2020) are an established method for modeling the performance of these networks, enabling researchers to plan large-scale training runs based on curated sets of smaller experiments. Traditionally, these laws focus on predicting proxy metrics for model quality, such as pre-training log-perplexity. This has proven invaluable for optimizing training hyperparameters, like the optimal ratio of tokens to parameters. Another important direction in understanding the scaling of LLMs is tracking the behavior of more interpretable indicators of model capabilities, like accuracy on downstream benchmarks measuring the performance on general knowledge, reasoning, math and coding tasks. Despite early attempts to solve this problem (Grattafiori et al., 2024; Isik et al., 2025; Chen et al., 2025), scaling downstream metrics have been often referred to as noisy and unreliable (Schaeffer et al., 2025; Lourie et al., 2025). Current approaches to modeling the downstream performance performance of LLMs (Grattafiori et al., 2024; Chen et al., 2025; Bhagia et al., 2024) typically rely on a two-stage approach, where the training budget is first mapped to a proxy metric like mean log-probability of the correct answer, and then another dependence is established, mapping to benchmark accuracy. Work done as an intern at Apple.