Goto

Collaborating Authors

 Large Language Model


Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses

arXiv.org Machine Learning

Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high-stakes domains such as medicine and law. Conformal inference provides distribution-free guarantees, but existing approaches are either overly conservative, discarding many true-claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores. Our method, Multi-LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality-scores, which in our experiments led to higher retention, while validity is preserved through group-conditional calibration. Experiments show that MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines. Our repository is available at https://github.com/MLAI-Yonsei/MACI


Hessian Spectral Analysis at Foundation Model Scale

arXiv.org Machine Learning

Accurate Hessian spectra of foundation models have remained out of reach, leading most prior work to rely on small models or strong structural approximations. We show that faithful spectral analysis of the true Hessian is tractable at frontier scale. Using shard-local finite-difference Hessian vector products compatible with Fully Sharded Data Parallelism, we perform stochastic Lanczos quadrature on open-source language models with up to 100B parameters, producing the first large-scale spectral density estimates beyond the sub-10B regime. We characterize the numerical behavior of this pipeline, including finite-difference bias, floating-point noise amplification, and their effect on Krylov stability in fp32 and bf16, and derive practical operating regimes that are validated empirically. We further provide end-to-end runtime and memory scaling laws, showing that full-operator spectral probing incurs only a modest constant-factor overhead over first-order training. Crucially, direct access to the Hessian reveals that widely used block-diagonal curvature approximations can fail catastrophically, exhibiting order-one relative error and poor directional alignment even in mid-scale LLMs. Together, our results demonstrate that foundation-model Hessian spectra are both computable and qualitatively misrepresented by prevailing approximations, opening the door to principled curvature-based analysis at scale.


Predicting and improving test-time scaling laws via reward tail-guided search

arXiv.org Machine Learning

Test-time scaling has emerged as a critical avenue for enhancing the reasoning capabilities of Large Language Models (LLMs). Though the straight-forward ''best-of-$N$'' (BoN) strategy has already demonstrated significant improvements in performance, it lacks principled guidance on the choice of $N$, budget allocation, and multi-stage decision-making, thereby leaving substantial room for optimization. While many works have explored such optimization, rigorous theoretical guarantees remain limited. In this work, we propose new methodologies to predict and improve scaling properties via tail-guided search. By estimating the tail distribution of rewards, our method predicts the scaling law of LLMs without the need for exhaustive evaluations. Leveraging this prediction tool, we introduce Scaling-Law Guided (SLG) Search, a new test-time algorithm that dynamically allocates compute to identify and exploit intermediate states with the highest predicted potential. We theoretically prove that SLG achieves vanishing regret compared to perfect-information oracles, and achieves expected rewards that would otherwise require a polynomially larger compute budget required when using BoN. Empirically, we validate our framework across different LLMs and reward models, confirming that tail-guided allocation consistently achieves higher reward yields than Best-of-$N$ under identical compute budgets. Our code is available at https://github.com/PotatoJnny/Scaling-Law-Guided-search.


Universal Redundancies in Time Series Foundation Models

arXiv.org Machine Learning

Time Series Foundation Models (TSFMs) leverage extensive pretraining to accurately predict unseen time series during inference, without the need for task-specific fine-tuning. Through large-scale evaluations on standard benchmarks, we find that leading transformer-based TSFMs exhibit redundant components in their intermediate layers. We introduce a set of tools for mechanistic interpretability of TSFMs, including ablations of specific components and direct logit attribution on the residual stream. Our findings are consistent across several leading TSFMs with diverse architectures, and across a diverse set of real-world and synthetic time-series datasets. We discover that all models in our study are robust to ablations of entire layers. Furthermore, we develop a theoretical framework framing transformers as kernel regressors, motivating a purely intrinsic strategy for ablating heads based on the stable rank of the per-head projection matrices. Using this approach, we uncover the specific heads responsible for degenerate phenomena widely observed in TSFMs, such as parroting of motifs from the context and seasonality bias. Our study sheds light on the universal properties of this emerging class of architectures for continuous-time sequence modeling.


OpenAI brings its Codex coding app to Mac, with new multi-agent abilities included

Engadget

Codex can now manage several AI assistants to complete complex tasks. Since last spring, OpenAI has offered Codex . What started life as the company's response to Claude Code is becoming something more sophisticated with the release of a new dedicated macOS app. At its most basic form, Codex is a programming agent capable of writing code for users, but now it can also manage multiple AI assistants that can work together to complete more complex tasks. OpenAI gives an example of how this could work in practice.


Do You Feel the AGI Yet?

The Atlantic - Technology

Do You Feel the AGI Yet? According to some predictions, 2026 is the year that an all-powerful AI will arrive. H undreds of billions of dollars have been poured into the AI industry in pursuit of a loosely defined goal: artificial general intelligence, a system powerful enough to perform at least as well as a human at any task that involves thinking. Will this be the year it finally arrives? Anthropic CEO Dario Amodei and xAI CEO Elon Musk think so.


It's Causing People to Lose Jobs, Shatter Relationships, and Drain Their Savings. One Support Group Is Sounding the Alarm.

Slate

A.I.-related psychosis has cost people their marriages, life savings, and grip on reality. Last August, Adam Thomas found himself wandering the dunes of Christmas Valley, Oregon, after a chatbot kept suggesting he mystically "follow the pattern" of his own consciousness. Thomas was running on very little sleep--he'd been talking to his chatbot around the clock for months by that point, asking it to help improve his life. Instead it sent him on empty assignments, like meandering the vacuous desert sprawl. He'd lost his job as a funeral director and was living out of a van, draining his savings, and now he found himself stranded in the desert. When he woke up outside on a stranger's futon with no money to his name, he knew he'd hit rock bottom. "I wasn't aware of the dangers at the time, and I thought that the A.I. had statistical analysis abilities that would allow it to assist me if I opened up about my life," Thomas told me.


Skip AI roulette and compare top models instantly with this lifetime plan for a flat 79

PCWorld

When you purchase through links in our articles, we may earn a small commission. Pay $79 once to compare GPT, Claude, Gemini, and 25+ AI models side by side -- unlimited prompts, no subscriptions. If you've ever copied the same prompt into ChatGPT, Claude, Gemini, and three other tools just to see which one nails it, congratulations -- you've already done the hard part. ChatPlayground AI simply removes the busywork. Instead of juggling tabs and second-guessing outputs, it brings dozens of top AI models into a single clean interface and lets them all answer at once. For a limited time, a lifetime subscription is on sale for $79 (MSRP $619).


Large Language Models: A Mathematical Formulation

arXiv.org Machine Learning

Large language models (LLMs) process and predict sequences containing text to answer questions, and address tasks including document summarization, providing recommendations, writing software and solving quantitative problems. We provide a mathematical framework for LLMs by describing the encoding of text sequences into sequences of tokens, defining the architecture for next-token prediction models, explaining how these models are learned from data, and demonstrating how they are deployed to address a variety of tasks. The mathematical sophistication required to understand this material is not high, and relies on straightforward ideas from information theory, probability and optimization. Nonetheless, the combination of ideas resting on these different components from the mathematical sciences yields a complex algorithmic structure; and this algorithmic structure has demonstrated remarkable empirical successes. The mathematical framework established here provides a platform from which it is possible to formulate and address questions concerning the accuracy, efficiency and robustness of the algorithms that constitute LLMs. The framework also suggests directions for development of modified and new methodologies.


Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models

arXiv.org Machine Learning

Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including LLMs used as judges. Most classical methods, e.g., Dawid-Skene or (weighted) majority voting, assume annotators are conditionally independent given the true label $Y\in\{0,1\}$, an assumption often violated by LLM judges due to shared data, architectures, prompts, and failure modes. Ignoring such dependencies can yield miscalibrated posteriors and even confidently incorrect predictions. We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors. For class-dependent Ising models, the Bayes log-odds is generally quadratic in votes; for class-independent couplings, it reduces to a linear weighted vote with correlation-adjusted parameters. We present finite-$K$ examples showing that methods based on conditional independence can flip the Bayes label despite matching per-annotator marginals. We prove separation results demonstrating that these methods remain strictly suboptimal as the number of judges grows, incurring nonvanishing excess risk under latent factors. Finally, we evaluate the proposed method on three real-world datasets, demonstrating improved performance over the classical baselines.