nnull
- North America > United States > New York > Erie County > Buffalo (0.05)
- Oceania > Australia > Victoria > Melbourne (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (4 more...)
Critical attention scaling in long-context transformers
Chen, Shi, Lin, Zhengjiang, Polyanskiy, Yury, Rigollet, Philippe
As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\textit{attention scaling}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $β_n$, theoretical justification for this approach remains lacking. We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $β_n$: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $β_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Supplementary Material Additional Notation
A.1 Robust Mean Estimation from Subset Stability The upper bound is always less than null for m n . Let m be the largest value of f (x) for any x T with w ( x) null= 0 . Thus, by the weighted version of Lemma 2.4 of [DK19], we have that nullµ Section B.1, we show a result stating that pre-processing on i.i.d. points yields a set that contains Then, in Section B.2, we use a coupling argument to show a We recall the median of means principle. We now state our main result in this section, proved using minimax duality, that Theorem B.1 implies We first consider the case of i.i.d. In particular, Lemma E.2 shows that we can deterministically round We now prove Theorem 1.7, i.e., stability of a subset after corruption, using Theorem B.2.
Anchor-MoE: A Mean-Anchored Mixture of Experts For Probabilistic Regression
Regression under uncertainty is fundamental across science and engineering. We present an Anchored Mixture of Experts (Anchor-MoE), a model that handles both probabilistic and point regression. For simplicity, we use a tuned gradient-boosting model to furnish the anchor mean; however, any off-the-shelf point regressor can serve as the anchor. The anchor prediction is projected into a latent space, where a learnable metric-window kernel scores locality and a soft router dispatches each sample to a small set of mixture-density-network experts; the experts produce a heteroscedastic correction and predictive variance. We train by minimizing negative log-likelihood, and on a disjoint calibration split fit a post-hoc linear map on predicted means to improve point accuracy. On the theory side, assuming a Hölder smooth regression function of order~$α$ and fixed Lipschitz partition-of-unity weights with bounded overlap, we show that Anchor-MoE attains the minimax-optimal $L^2$ risk rate $O\!\big(N^{-2α/(2α+d)}\big)$. In addition, the CRPS test generalization gap scales as $\widetilde{O}\!\Big(\sqrt{(\log(Mh)+P+K)/N}\Big)$; it is logarithmic in $Mh$ and scales as the square root in $P$ and $K$. Under bounded-overlap routing, $K$ can be replaced by $k$, and any dependence on a latent dimension is absorbed into $P$. Under uniformly bounded means and variances, an analogous $\widetilde{O}\!\big(\sqrt{(\log(Mh)+P+K)/N}\big)$ scaling holds for the test NLL up to constants. Empirically, across standard UCI regressions, Anchor-MoE consistently matches or surpasses the strong NGBoost baseline in RMSE and NLL; on several datasets it achieves new state-of-the-art probabilistic regression results on our benchmark suite. Code is available at https://github.com/BaozhuoSU/Probabilistic_Regression.
9ecff5455677b38d19f49ce658ef0608-AuthorFeedback.pdf
We thank the reviewers for their positive and constructive feedback. We address several points in the review below. The bias reduction technique in Section 5 is designed for DP-SGD with clipping. When it is applied to DP-SGD, the update rule is shown below. Typos: Thank you for pointing them out, we will correct the typos.
Supplementary Material Fairness in Ranking under Uncertainty A Related Work
The group fairness perspective imposes constraints like demographic parity (Calders et al., 2009; Zliobaite, 2015) and equalized odds (Hardt et al., 2016). Although similar in spirit, our work sidesteps this need to define a similarity metric between agents in the feature space. Rather, we view an agent's Ranking has been widely studied in the field of Information Retrieval (IR), mostly in the context of optimizing user utility. The Probability Ranking Principle (PRP) (Robertson, 1977), a guiding principle for ranking in IR, states that user utility is optimal when documents (i.e., the agents) are Besides ranking diversity, IR methods have dealt with uncertainty in relevance that comes via users' implicit or explicit feedback (Penha and Hauff, 2021; Soufiani et al., 2012), as well as stochasticity arising Kearns et al. (2017) present a way to fairly select Hence, they propose using the true CDF rank as a derived merit criterion that can be compared. Thus, a fair principal stands to gain more by obtaining perfect information.
Point Prediction for Streaming Data
Chanda, Aleena, Vinodchandran, N. V., Clarke, Bertrand
We present two new approaches for point prediction with streaming data. One is based on the Count-Min sketch (CMS) and the other is based on Gaussian process priors with a random bias. These methods are intended for the most general predictive problems where no true model can be usefully formulated for the data stream. In statistical contexts, this is often called the $\mathcal{M}$-open problem class. Under the assumption that the data consists of i.i.d samples from a fixed distribution function $F$, we show that the CMS-based estimates of the distribution function are consistent. We compare our new methods with two established predictors in terms of cumulative $L^1$ error. One is based on the Shtarkov solution (often called the normalized maximum likelihood) in the normal experts setting and the other is based on Dirichlet process priors. These comparisons are for two cases. The first is one-pass meaning that the updating of the predictors is done using the fact that the CMS is a sketch. For predictors that are not one-pass, we use streaming $K$-means to give a representative subset of fixed size that can be updated as data accumulate. Preliminary computational work suggests that the one-pass median version of the CMS method is rarely outperformed by the other methods for sufficiently complex data. We also find that predictors based on Gaussian process priors with random biases perform well. The Shtarkov predictors we use here did not perform as well probably because we were only using the simplest example. The other predictors seemed to perform well mainly when the data did not look like they came from an M-open data generator.
- North America > United States > Nebraska > Lancaster County > Lincoln (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (4 more...)
- Information Technology > Modeling & Simulation (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)