Large Language Model
Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs. This approach is motivated by practical constraints. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost. However, current approaches to deploying LLM evaluators are ad hoc, typically limited to reporting agreement metrics between human and LLM judges as a justification for substitution of human ratings, and lack a formal basis for study design. This paper (1) shifts the role of the LLM judge from substitutive to auxiliary, and (2) formulates the LLM-as-a-judge paradigm as one of augmenting human evaluation through a two-stage sampling design, where LLM evaluations are measured for all observations at the first stage and human ratings are partially observed for a subsample at the second stage. We propose to use a doubly robust estimator from the missing data literature, which takes advantage of the robustness property against the prediction model, since the missingness model is known by design. Using the asymptotic variance of this estimator, we propose how sample sizes of human and LLM ratings can be determined to achieve a targeted level of power. We also show that a study can be efficiently designed by allocating more human ratings for types of evaluations where the predictability of LLM ratings is not high. To the best of our knowledge, there is very little guidance on how much human oversight should be retained when validating benchmarks.
Your SaaS Is an Insurance Product: A Modeling Framework
Capped-usage SaaS products -- LLM subscriptions such as Claude Code and ChatGPT, cloud platforms such as Vercel and Cloudflare Workers, corporate benefit platforms, identity-verification services with liability transfer -- share a structural signature with insurance products: a fixed premium decoupled from realized consumption, stochastic per-user demand with heavy-tailed severity, a non-fungible cap that resets on a fixed schedule, and a portfolio-level exposure that requires reserve adequacy under tail risk. We argue that this is not an analogy. It is the same operational problem actuarial science has been tooled for decades to address, restated with new dependent variables (tokens, bandwidth bytes, function-invocations, gym check-ins) in place of medical claims. This paper proposes a modeling framework for capped-usage SaaS pricing built from frequency-severity decomposition, premium calculation principles, and Monte Carlo reserve adequacy. We map the framework to publicly observable subscription tiers in two domains (LLM services and cloud platforms), ground it in canonical health-insurance economics (Arrow 1963; Pauly 1968; Manning et al. 1987; Brot-Goldberg et al. 2017), and demonstrate divergence from traditional unit economics through a worked example. The contribution is operational rather than theoretical: not a new theorem, but vocabulary and tools currently absent from cs.LG/stat.ML practice.
StatQAT: Statistical Quantizer Optimization for Deep Networks
Aktukmak, Mehmet, Huang, Daniel, Ding, Ke
Quantization is essential for reducing the computational cost and memory usage of deep neural networks, enabling efficient inference on low-precision hardware. Despite the growing adoption of uniform and floating-point quantization schemes, selecting optimal quantization parameters remains a key challenge, particularly for diverse data distributions encountered during training and inference. This work presents a novel statistical error analysis framework for uniform and floating-point quantization, providing theoretical insight into error behavior across quantization configurations. Building on this analysis, we propose iterative quantizers designed for arbitrary data distributions and analytic quantizers tailored for Gaussian-like weight distributions. These methods enable efficient, low-error quantization suitable for both activations and weights. We incorporate our quantizers into quantization-aware training and evaluate them across integer and floating-point formats. Experiments demonstrate improved accuracy and stability, highlighting the effectiveness of our approach for training low-precision neural networks.
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language
Yang, Zhihan, Guo, Wei, Zhang, Shuibai, Sahoo, Subham Sekhar, Chen, Yongxin, Vahdat, Arash, Mardani, Morteza, Thickstun, John
While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.
Probing for Representation Manifolds in Superposition
This paper introduces the Manifold Probe, a supervised method for discovering representation manifolds in superposition. The method generalizes linear regression probes by learning the space of features of a concept that can be linearly predicted from the representations, and then learning the directions used to encode them. We demonstrate the probe on representations of time and space in Llama 2-7b, finding manifolds which linearly represent an interpretable set of features in each case. In the case of time, we show that by steering along the manifold, we can influence the model's completions about the years in which famous songs, movies and books were released, providing evidence that the Manifold Probe can discover manifolds which are causally involved in model behaviour.
How Sam Altman's victory over Elon Musk clears way for OpenAI's trillion-dollar ambitions
Elon Musk, left, and Sam Altman. Elon Musk, left, and Sam Altman. How Sam Altman's victory over Elon Musk clears way for OpenAI's trillion-dollar ambitions OpenAI's plans now seem all but guaranteed, given that the world's richest man couldn't put a stop to them On Monday morning, a jury in Oakland, California, handed a resounding victory to Sam Altman and OpenAI in their long, bitter courtroom battle with Elon Musk. The federal jury found Altman, OpenAI and its president, Greg Brockman, not liable for Elon Musk's claims that they unjustly enriched themselves and broke a founding contract made with Musk when founding the startup. The unanimous verdict, delivered after less than two hours of deliberation, is a stark rebuke of Musk and his lawyer's claims that Altman "stole a charity" through his leadership of OpenAI.
Jury hands victory to Sam Altman and OpenAI in battle with Elon Musk
The federal jury in Oakland, California, found Altman, OpenAI and its president, Greg Brockman, not liable for Elon Musk's claims that they unjustly enriched themselves and broke a founding contract made with Musk when founding the startup. The verdict, delivered after less than two hours of deliberation, is a stark rebuke of Musk and his lawyer's claims that Altman "stole a charity" through his leadership of OpenAI . It also provides the AI firm with a clear path ahead to pursue going public later this year at about a $1tn valuation . The jury's finding is a non-binding, advisory verdict that left Judge Yvonne Gonzalez Rogers with ultimate power to issue her own ruling in the case. Gonzalez Rogers immediately said that she would agree with the jury's decision and dismissed Musk's claims.
Google just made big changes to Gemini usage limits
Google is switching Gemini from fixed daily request limits to a compute-based system that considers prompt complexity, features used, and chat length. PCWorld reports this change reflects increasing demands from advanced agentic AI features, with limits refreshing every five hours under a weekly cap. Paid plans offer significantly higher usage allowances, with Ultra providing 20x standard limits compared to free tier access. Google is changing how it calculates your weekly Gemini usage limits, and it's another reflection of how powerful agentic AI features have broken flat-rate consumer AI plans. As of now, Google says it's switching to "compute-based" usage rather than a fixed number of requests per day.
Jury tosses Elon Musk's lawsuit against OpenAI and its boss Sam Altman
A California jury has tossed out Elon Musk's high-profile lawsuit against OpenAI and its boss Sam Altman. In a unanimous verdict, the case was thrown out because Musk had filed his lawsuit after a statute of limitations to bring such claims had expired. Musk had accused Altman of breaching a non-profit contract by shifting the ChatGPT-maker to a for-profit company after Musk donated $38m (£28.5m). Musk had argued Altman deceived him by accepting his money and then reneging on OpenAI's original non-profit mission to develop artificial intelligence (AI) technology for the benefit of humanity. Jurors spent three weeks viewing internal correspondence and hearing testimony, and arrived at a verdict on Monday after deliberating for roughly two hours.
Elon Musk loses US lawsuit against OpenAI
A United States jury has ruled against Elon Musk in his lawsuit against OpenAI, finding the artificial intelligence (AI) company not liable to the world's richest person for having allegedly strayed from its original mission to benefit humanity. In a unanimous verdict on Monday, the jury in Oakland, California US federal court said Musk had brought his case too late. Following the verdict, Musk's lawyer said he reserved the right to appeal, but the judge suggested he may have an uphill battle because whether the statute of limitations ran out before Musk sued was a factual issue. "There's a substantial amount of evidence to support the jury's finding, which is why I was prepared to dismiss on the spot," US District Judge Yvonne Gonzalez Rogers said. Musk was a co-founder of OpenAI, the company that launched in 2015 and went on to create ChatGPT.