Goto

Collaborating Authors

 Large Language Model


When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

arXiv.org Machine Learning

Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.


Instance-Optimal Estimation with Multiple LLM Judges on a Budget

arXiv.org Machine Learning

Evaluating large language models increasingly relies on LLM-as-a-judge protocols, but such evaluations remain costly: different judges have different prices and reliabilities, and the difficulty of each prompt-response pair can vary substantially. This raises a basic allocation question: under a fixed budget, how should one distribute evaluation queries across heterogeneous judges and instances to obtain the most accurate score estimates? We formalize this question as *budgeted heteroskedastic multi-judge estimation*. Given $K$ prompt-response pairs, $J$ judges with known costs, and unknown query-judge variances, the goal is to estimate a bounded score vector while minimizing an $\ell_p$-error. Our first contribution is to analyze the inverse-variance weighted estimator (IVWE) and to derive the oracle allocation that minimizes its error rate. Since this allocation depends on the unknown variances, we then address the practical unknown-variance setting by proposing EST-IVWE, an adaptive algorithm that constructs and leverages *optimistically biased* variance estimates to stabilize the empirical allocation. We prove that EST-IVWE matches the oracle IVWE rate up to lower-order terms in the budget. Our second and central theoretical contribution is a matching *local* minimax lower bound, which establishes the instance-optimality of the proposed algorithms. A key technical insight is that Fano-type high-probability arguments are too coarse for this problem: their packing construction loses the local variance structure that governs the optimal allocation. We instead use an Assouad-type in-expectation argument, based on local perturbations, which preserves this structure and yields the sharp allocation-dependent lower bound. Finally, we numerically validate the superiority of our approach over naรฏve uniform allocation on synthetic and HelpSteer2 datasets.


Training-Free Looped Transformers

arXiv.org Machine Learning

We introduce training-free looped transformers, in which a lightweight inference-time wrapper loops a contiguous mid-stack block of layers of a frozen checkpoint without additional fine-tuning, continued training, or architectural changes. Unlike prior looped transformer methods that train with the looped structure end-to-end, we retrofit recurrence onto pretrained models at test time. We show that naive block reapplication usually degrades performance, highlighting the importance of the loop application strategy. Motivated by viewing a pre-norm transformer block as a forward Euler step on an ODE, we instead treat looping as a refinement of the same approximation, replacing one large update with smaller damped sub-steps. Across seven dense, sparse MoE, and MLA+MoE model families, our method improves Qwen3-4B-Instruct by +2.64 pp on MMLU-Pro, Qwen3-30B-A3B-Instruct by +1.14 pp on CommonsenseQA, and Moonlight-16B-A3B-Instruct by +1.20 pp on OpenBookQA.


Scotland's 'green datacentres' policy ignores emissions impact of AI, analysis shows

The Guardian

Facilities can be branded as aligned with Scotland's climate goals despite significant emissions, said APRS. Facilities can be branded as aligned with Scotland's climate goals despite significant emissions, said APRS. Scotland's'green datacentres' policy ignores emissions impact of AI, analysis shows A Scottish government policy designed to encourage datacentres to build in Scotland could lead to a massive volume of carbon emissions being ignored, according to an analysis by a Scottish charity. "Green datacentres" are at the heart of Scotland's ambitions to develop economically. Enshrined in national policy, they are part of a larger, UK-wide effort to attract big AI investment to Scotland.


I'm a Professional Writer Who Uses a Very Controversial Tool. It's Not As Scary As I Thought.

Slate

I was skeptical about ChatGPT and Claude at first. Then I started to come around--and I'm glad I did. Enter your email to receive alerts for this author. You can manage your newsletter subscriptions at any time. You're already subscribed to the aa_Alex_Kirshner newsletter.


I avoid AI tools because thinking is supposed to be hard. It's what makes us human Wendy Liu

The Guardian

I avoid AI tools because thinking is supposed to be hard. It's what makes us human Long before the age of multi-billion-dollar AI companies promising to disrupt the field of software development, I was learning to code the hard way. It was the mid-2000s, and I was a child with unmonitored access to the family computer. With the help of a basic text editor program, I learned how to make websites - first basic, then increasingly complex - from scratch. The results were never as beautiful or polished as in my imagination, but I could live with that, because I was learning a craft. The painstaking hours of debugging and poring over arcane documentation for projects that I eventually abandoned never felt wasted.


DeepSeek permanently reduces the price of its flagship V4 model by 75 percent

Engadget

The lower prices could be aimed at undercutting the competition. DeepSeek is leaning hard into being the cost-effective choice for AI agents. According to its website, the Chinese startup is dropping the price for its latest flagship model, DeepSeek V4 Pro, to a fourth of its original price. This latest price update makes permanent the 75 percent discount promotion that was previously supposed to end on May 31, 2026. As seen on the website's pricing page, the DeepSeek V4 Pro prices now range from $0.003625 to $0.87 per one million tokens, compared to the previous range between $0.0145 to $3.48 for every million tokens.


Anthropic says Mythos has already found more than 10,000 vulnerabilities

Engadget

The company has published an update about Project Glasswing, a month after its launch. Anthropic has published an initial report for Project Glasswing, the cybersecurity initiative it launched in April that aims to prevent AI cyberattacks with, well, AI. The initiative is powered by Claude Mythos Preview, the company's unreleased model, which Anthropic says has already helped its partners find more than ten thousand vulnerabilities overall just a month after Glasswing's launch. In addition, it says most of its partners have each found hundreds of critical-or high-severity vulnerabilities in their software using the model. The company said that its partners' rate of bug-finding has increased by more than a factor of ten.


The Download: coding's future, the 'Steroid Olympics,' and AI-driven science

MIT Technology Review

Plus: Trump has postponed an AI order due to overregulation fears. Anthropic's Code with Claude showed off coding's future--whether you like it or not At Anthropic's developer event in London this week, Code with Claude, attendees were asked if they'd shipped code written entirely by Claude. Almost half the room raised their hands. Many admitted they hadn't even read the code before pushing it live. As tools like Claude Code get better, more and more developers are happy to hand their work off to AI. Anthropic says it wants to push automation as far as it will go. But not everyone is convinced that's the right approach.


Can OpenAI's 'Master of Disaster' Fix AI's Reputation Crisis?

WIRED

Global affairs chief Chris Lehane wants to tone down the debate over AI's societal impacts--and get states to pass laws that won't derail OpenAI's meteoric rise. Three months ago, OpenAI cofounder Greg Brockman told me his concerns about a mounting public relations crisis facing artificial intelligence companies: Despite the popularity of tools like ChatGPT, an increasingly large share of the population said they viewed AI negatively. Since then, the backlash has only intensified. College commencement speakers are now getting booed for talking about AI in optimistic terms. Last month, someone threw a Molotov cocktail at OpenAI CEO Sam Altman's San Francisco home and wrote a manifesto advocating for crimes against AI executives.