Goto

Collaborating Authors

 Large Language Model


Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

arXiv.org Machine Learning

This paper studies the role of sinks and diagonal patterns as attention switch and anti-oversmoothing mechanisms. We analyze geometric conditions under which sinks can be represented, showing a necessary alignment between the embedding of the sink and all other embeddings. Next, we refine the current understanding of the role of sinks in oversmoothing prevention: we specify the conditions under which dense attention provably smooths more than sparse attention, and empirically verify that such conditions are often satisfied in practice. We further prove an equivalence between sinks and hard attention switch, in which the output of the attention is identically 0. Finally, we relax the hard attention switch by allowing token self-communication: we provide a quantitative comparison of the costs of representing sinks vs.\ diagonal patterns, showing why sinks are favored in pretrained transformers. The introduction and analysis of diagonal patterns and the generalization of the attention switch close the gap between what oversmoothing prevention requires and what sinks provide, while also establishing when and why attention layers act like MLPs if token communication is not necessary.


CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

arXiv.org Machine Learning

Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.


Federated Language Models Under Bandwidth Budgets: Distillation Rates and Conformal Coverage

arXiv.org Machine Learning

Training a language model on data scattered across bandwidth-limited nodes that cannot be centralized is a setting that arises in clinical networks, enterprise knowledge bases, and scientific consortia. We study the regime in which data must remain distributed across nodes, and ask what statistical guarantees are in principle achievable under explicit bandwidth budgets; we aim to characterize what is provably possible, not to demonstrate a deployment-ready system. Existing theory treats either training-time consistency or inference-time calibration in isolation, and none makes bandwidth a first-class statistical parameter. We analyze two protocols, Federated Probe-Logit Distillation (FPLD) for training and Federated Conformal RAG (FC-RAG) for inference, as the analytical vehicles for our results. Our first main result is an explicit high-probability KL-consistency rate for FPLD with simultaneous dependence on node count $K$, per-node sample size $n$, quantization budget $B$, probe-set size $m$, and vocabulary size $V$; bandwidth enters only through an exponentially vanishing quantization term. Our second main result is a distribution-free marginal-coverage bound for FC-RAG, whose novel retrieval-bandwidth slack $Δ_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$ makes per-node retrieval bandwidth a first-class statistical parameter, with arithmetic aggregation across $K$ nodes shrinking the slack as $K^{-1/2}$ in the per-node-uniform regime. A Pinsker-type corollary composes the two bounds into an end-to-end coverage guarantee. Synthetic experiments verify the predicted scaling along the bounds' parameters; small-scale experiments on a GPT-2 testbed illustrate that the qualitative bandwidth-accuracy tradeoff survives on a real language model. A deployment-scale empirical evaluation is out of scope.


When Can Digital Personas Reliably Approximate Human Survey Findings?

arXiv.org Machine Learning

Digital personas powered by Large Language Models (LLMs) are increasingly proposed as substitutes for human survey respondents, yet it remains unclear when they can reliably approximate human survey findings. We answer this question using the LISS panel, constructing personas from respondents' background variables and pre-2023 survey histories, then testing them against the same respondents' held-out post-cutoff answers. Across four persona architectures, three LLMs, and two prediction tasks, we assess performance at the question, respondent, distributional, equity, and clustering levels. Digital personas improve alignment with human response distributions, especially in domains tied to stable attributes and values, but remain limited for individual prediction and fail to recover multivariate respondent structure. Retrieval-augmented architectures provide the clearest gains, but performance depends more on human response structure than on model choice: personas perform best for low-variability questions and common respondent patterns, and worst for subjective, heterogeneous, or rare responses. Our results provide practical guidance on when digital personas could be appropriate for survey research and when human validation remains necessary.


What should post-training optimize? A test-time scaling law perspective

arXiv.org Machine Learning

Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.


Ilya Sutskever Stands by His Role in Sam Altman's OpenAI Ouster: 'I Didn't Want It to Be Destroyed'

WIRED

Ilya Sutskever Stands by His Role in Sam Altman's OpenAI Ouster: 'I Didn't Want It to Be Destroyed' The former OpenAI chief scientist may be estranged from the company, but he still came to its defense as he testified on Monday. Elon Musk's trial against OpenAI and Microsoft entered its final stretch on Monday, with testimony from Microsoft CEO Satya Nadella, former OpenAI chief scientist Ilya Sutskever, and current OpenAI chairman Bret Taylor. Sutskever drew the spotlight, revealing an ownership stake in OpenAI's $850-billion for-profit arm that is currently worth about $7 billion. That makes him one of the largest known individual shareholders of OpenAI. Earlier in the trial, OpenAI president Greg Brockman acknowledged for the first time that he has around $30 billion worth of OpenAI shares .


AI-powered hacking has exploded into industrial-scale threat, Google says

The Guardian

'There's a misconception that the AI vulnerability race is imminent. The reality is it's already begun,' said John Hultquist at Google's threat intelligence group. 'There's a misconception that the AI vulnerability race is imminent. The reality is it's already begun,' said John Hultquist at Google's threat intelligence group. In just three months, AI-powered hacking has gone from a nascent problem to an industrial-scale threat, according to a report from Google .


The Download: the hantavirus outbreak and Musk v. Altman week 2

MIT Technology Review

Plus: Meta's embrace of AI is making employees miserable. Here's what you need to know about the cruise ship hantavirus outbreak Last week, eight passengers aboard a Dutch-flagged cruise ship contracted a type of hantavirus transmitted by rats. But health experts stress that this situation is nothing like the coronavirus outbreak in 2020. The Andes virus is known to spread between people, and there are no specific antiviral treatments or vaccines. Yet transmission appears to require a specific form of contact that the cruise ship fostered. Here's what you need to know about the outbreak--and why experts believe it can be contained .


I Work in Hollywood. Everyone Who Used to Make TV Is Now Secretly Training AI

WIRED

For screenwriters like me--and job seekers all over--AI gig work is the new waiting tables. In eight months, I've done 20 of these soul-crushing contracts for five different platforms. My name on the platform is ri611. I work as an AI trainer. I assess whether a chatbot's tone is natural or flat, affected or annoying. I identify patterns in pictures of furniture; search the internet for group photos of strangers whom I'll eliminate from the portrait, one by one. I trawl through bizarre videos so I can annotate and time-stamp the barking of a dog, the moment a stranger walks past a window, the precise millisecond a balloon pops. I generate anime sex scenes and decapitate young women, coax LLMs into giving me recipes for bombs made of household items, and generate invites to a reprise of January 6 at the White House, all as part of a red team whose purpose is to test safety precautions and probe weaknesses. I work for companies with names like Mercor and Outlier and Task-ify and Turing and Handshake and Micro1. In my "other" career, I am a Hollywood writer and showrunner. I create prime-time TV, usually featuring a middle-class white lady having the worst day of her life, with some salt-of-the-earth police interference to raise the stakes. You can find my shows on Paramount and Hulu and the BBC.


CUDA Proves Nvidia Is a Software Company

WIRED

There's a deep, forbidding moat that surrounds Nvidia--and it has nothing to do with hardware. Forgive me for starting with a cliché, a piece of finance jargon that has recently slipped into the tech lexicon, but I'm afraid I must talk about "moats." Popularized decades ago by Warren Buffett to refer to a company's competitive advantage, the word found its way into Silicon Valley pitch decks when a memo purportedly leaked from Google, titled "We Have No Moat, and Neither Does OpenAI," fretted that open-source AI would pillage Big Tech's castle. A few years on, the castle walls remain safe. Apart from a brief bout of panic when DeepSeek first appeared, open-source AI models have not vastly outperformed proprietary models.