sharpness
Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention
Yadav, Akash, Adebiyi, Taiwo A., Zhang, Ruda
Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on two scientific foundation models for weather and timeseries forecasting along with an additional regression task. Across benchmarks against uncertainty-aware baselines, we find that Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable coverage, while requiring only minutes of post-hoc tuning versus days of retraining for competitive baselines.
- North America > United States > Texas > Harris County > Houston (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Monterey County > Monterey (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
Generalization at the Edge of Stability
Tuci, Mario, Korkmaz, Caner, Şimşekli, Umut, Birdal, Tolga
Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.
- Europe > France (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Italy (0.04)
When to Trust the Cheap Check: Weak and Strong Verification for Reasoning
Kiyani, Shayan, Noorani, Sima, Pappas, George, Hassani, Hamed
Reasoning with LLMs increasingly unfolds inside a broader verification loop. Internally, systems use cheap checks, such as self-consistency or proxy rewards, which we call weak verification. Externally, users inspect outputs and steer the model through feedback until results are trustworthy, which we call strong verification. These signals differ sharply in cost and reliability: strong verification can establish trust but is resource-intensive, while weak verification is fast and scalable but noisy and imperfect. We formalize this tension through weak--strong verification policies, which decide when to accept or reject based on weak verification and when to defer to strong verification. We introduce metrics capturing incorrect acceptance, incorrect rejection, and strong-verification frequency. Over population, we show that optimal policies admit a two-threshold structure and that calibration and sharpness govern the value of weak verifiers. Building on this, we develop an online algorithm that provably controls acceptance and rejection errors without assumptions on the query stream, the language model, or the weak verifier.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Super Consistency of Neural Network Landscapes and Learning Rate Transfer Lorenzo Noci
Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit ( µ P and its depth extension), then some hyperparameters -- such as the learning rate -- exhibit transfer from small to very large models. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is consistently similar across very different model sizes. In this work, we study the landscape through the lens of the loss Hessian, with a focus on its largest eigenvalue (i.e. the sharpness), and find that certain spectral properties under µ P are largely independent of the size of the network, and remain consistent as training progresses. We name this property Super Consistency of the landscape. On the other hand, we show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.92)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Nevada (0.04)
- North America > Canada > Ontario > Toronto (0.04)
How Sparse Can We Prune A Deep Network: A Fundamental Limit Perspective
Network pruning is a commonly used measure to alleviate the storage and computational burden of deep neural networks. However, the fundamental limit of network pruning is still lacking. To close the gap, in this work we'll take a first-principles approach, i.e. we'll directly impose the sparsity constraint on the loss function and leverage the framework of statistical dimension in convex geometry, thus enabling us to characterize the sharp phase transition point, which can be regarded as the fundamental limit of the pruning ratio. Through this limit, we're able to identify two key factors that determine the pruning ratio limit, namely, weight magnitude and network sharpness .
- North America > Canada > Ontario > Toronto (0.14)
- Asia > China (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Middle East > Israel (0.04)
- Leisure & Entertainment (0.46)
- Information Technology (0.46)