Large Language Model
Feature Starvation as Geometric Instability in Sparse Autoencoders
Chaudhry, Faris, Yano, Keisuke, Monod, Anthea
Sparse autoencoders (SAEs) are used to disentangle the dense, polysemantic internal representations of large language models (LLMs) into interpretable, monosemantic concepts. However, standard $\ell_1$-regularized SAEs suffer from feature starvation (dead neurons) and shrinkage bias, often requiring computationally expensive heuristic resampling and nondifferentiable hard-masking methods to bypass these challenges. We argue that feature starvation is not merely an empirical artifact of poor data diversity, but a fundamental optimization-geometric pathology of overcomplete dictionaries: the $\ell_1$-induced sparse coding map is unstable and fundamentally misaligned with shallow, amortized encoders. To address this structural instability, we introduce adaptive elastic net SAEs (AEN-SAEs), a fully differentiable architecture grounded in classical sparse regression. AEN-SAEs combine an $\ell_2$ structural term that enforces strong convexity and Lipschitz stability with adaptive $\ell_1$ reweighting that eliminates shrinkage bias and suppresses spurious features, thereby jointly controlling the curvature and interaction structure of the induced polyhedral geometry. Theoretically, we show that AEN-SAEs yield a Lipschitz-continuous sparse coding map and recover the global feature support under mild assumptions. Empirically, across synthetic settings and LLMs (Pythia 70M, Llama 3.1 8B), AEN-SAEs mitigate feature starvation without auxiliary heuristics while maintaining competitive reconstruction abilities.
In-Context Positive-Unlabeled Learning
Liu, Siyan, Chang, Yi, Cheng, Manli, Tian, Qinglong, Li, Pengfei
Positive-unlabeled (PU) learning addresses binary classification when only a set of labeled positives is available alongside a pool of unlabeled samples drawn from a mixture of positives and negatives. Existing PU methods typically require dataset-specific training or iterative optimization, which limits their applicability when many tasks must be solved quickly or with little tuning. We introduce PUICL, a pretrained transformer that solves PU classification entirely through in-context learning. PUICL is pretrained on synthetic PU datasets generated from randomly instantiated structural causal models, exposing it to a wide range of feature-label relationships and class-prior configurations. At inference time, PUICL receives the labeled positives and the unlabeled samples as a single input and returns class probabilities for the unlabeled rows in one forward pass, with no gradient updates or per-task fitting. On 20 semi-synthetic PU benchmarks derived from the UCI Machine Learning Repository, OpenML, and scikit-learn, PUICL outperforms four standard PU learning baselines in average AUC and accuracy, and is competitive on F1-score. These results show that the in-context learning paradigm extends naturally beyond fully supervised tabular prediction to the semi-supervised PU setting.
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Liu, Andy Zeyi, Paquette, Elliot, Sous, John
Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size acts as a latent determinant of representation geometry: runs that reach equal loss settle into systematically distinct activation spectra. Second, the activation covariance tail measured early in training reliably forecasts downstream token efficiency. Third, movement of the activation spectrum head (leading modes), together with gradient spectra, characterizes underlying learning-dynamics changes, separating learning-side architectural improvements from primarily execution-side gains. These predictive and diagnostic signals persist across the 12-, 36-, and 48-layer model tiers. Finally, a mechanistic model proves the main observations and explains how activation covariance spectra correlate with task-aligned feature learning.
CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency
Ota, Hirofumi, Iwase, Naoto, Ichihara, Yuki, Komiyama, Junpei, Imaizumi, Masaaki
Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is data-dependent and the set of possible response labels is not known in advance. We study anytime-valid certification of a prespecified target answer as the unique mode of the model's response distribution, a guarantee distinct from answer correctness. We propose the Certification by Intersection-union Testing with Eprocesses (CITE) algorithm, which provably controls false certification at any prescribed level under arbitrary data-driven stopping, without requiring prior knowledge of the answer category set. We also prove a category-set-size-free stopping-time rate, establish matching minimax lower bounds up to constants in the main regime, and extend the construction to confidence-weighted voting. Simulations and LLM self-consistency experiments show empirical error control and improved certification in diffuse-tail settings.
Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
Xu, Yang, Zhang, Jiefu, Sun, Haixiang, Zhou, Zihan, Cao, Tianyu, Aggarwal, Vaneet
Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedureperformance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winnerbased reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target. Codes are available at https://github.com/jznmsl/siren.
Attributions All the Way Down? The Metagame of Interpretability
Baniecki, Hubert, Biecek, Przemyslaw, Fumagalli, Fabian
We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $ϕ(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $φ_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
Li, Siquan, Jiang, Kaiqi, Sun, Jiacheng, Hu, Tianyang
Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.
OpenAI debuts a Codex plugin for Chrome
We're seeing coding be one of the leading applications of artificial intelligence tools, and OpenAI is continuing to expand on its offerings in that space. The company has launched a Chrome extension for its Codex platform. The new browser-based capabilities of the plugin include testing web apps, collecting context from across open tabs and using Chrome DevTools in parallel while the user performs other tasks. This extension could also help Codex be more appealing to casual users and additional professions beyond developers since so many computing tasks happen in browsers. Codex can now take on more of your browser dev work.
ChatGPT Has 'Goblin' Mania in the US. In China It Will 'Catch You Steadily'
OpenAI's chatbot has some weird linguistic tics in Chinese that are driving users crazy. Are you even online in 2026 if you haven't experienced the verbal tics of ChatGPT? It loves goblins, em dashes, and "it's not A; it's B" sentence constructions. But what you might not know is that the chatbot also has plenty of strange phrases it loves to say in Chinese, and they are driving Chinese users crazy. ChatGPT does a decent job answering questions in Chinese, which is why it's widely used in China despite being blocked by the government.
This 'anti-goal' prompt trick keeps ChatGPT from going rogue
When you purchase through links in our articles, we may earn a small commission. This'anti-goal' prompt trick keeps ChatGPT from going rogue A simple prompt structure using XML tags can stop ChatGPT, Claude, and Gemini from doing things you never asked for. All too often, ChatGPT, Claude, and Gemini overstep their instructions because they're so focused on making you happy. For example, an AI may jump ahead and completely rewrite a document when all you wanted was some focused feedback, or it may draft a brand-new recipe when you just wanted help substituting an ingredient. You might think the solution is to tell the AI chatbot what it do in your prompt.