Goto

Collaborating Authors

 softmax


Rethinking Approximate Gaussian Inference in Classification

Neural Information Processing Systems

In classification tasks, softmax functions are ubiquitously used as output activations to produce predictive probabilities. Such outputs only capture aleatoric uncertainty. To capture epistemic uncertainty, approximate Gaussian inference methods have been proposed. We develop a common formalism to describe such methods, which we view as outputting Gaussian distributions over the logit space. Predictives are then obtained as the expectations of the Gaussian distributions pushed forward through the softmax.


CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers Yoshihiro Yamada Preferred Networks yyamada@preferred.jp

Neural Information Processing Systems

Transformers have driven remarkable breakthroughs in natural language processing 2and computer vision, yet their standard attention mechanism still imposes O(N) complexity, hindering scalability to longer sequences. We introduce Circularconvolutional ATtention (CAT), a Fourier-based approach that efficiently applies circular power. CA con T volutions achieves to O reduce (N log comple N) computations, xity without requires sacrificing fewer representational learnable parameters by streamlining fully connected layers, and introduces no additional heavy operations, resulting in consistent accuracy improvements and about a 10% speedup in naive PyTorch implementations. Based on the Engineering-Isomorphic Transformers (EITs) framework, CAT's design not only offers practical efficiency and ease of implementation, but also provides insights to guide the development of


Mind the Gap Removing the Gap in Differentiable Logic Gate Networks

Neural Information Processing Systems

Modern neural networks exhibit state-of-the-art performance on many existing benchmarks, but their high computational requirements and energy usage cause researchers to explore more efficient solutions for real-world deployment. Differentiable logic gate networks (DLGNs) learns a large network of logic gates for efficient image classification. However, learning a network that can solve simple problems like CIFAR-10 or CIFAR-100 can take days to weeks to train. Even then, almost half of the neurons remains unused, causing a discretization gap. This discretization gap hinders real-world deployment of DLGNs, as the performance drop between training and inference negatively impacts accuracy. We inject Gumbel noise with a straight-through estimator during training to significantly speed up training, improve neuron utilization, and decrease the discretization gap. We theoretically show that this results from implicit Hessian regularization, which improves the convergence properties of DLGNs. We train networks 4.5 faster in wall-clock time, reduce


Attention Mechanism, Max-Affine Partition, and Universal Approximation

Neural Information Processing Systems

We establish the universal approximation capability of single-layer, single-head self-and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the L -norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under Lp-norm for 1 p < . Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.


Limitations of Normalization in Attention Mechanism

Neural Information Processing Systems

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.



TreeGen: ABayesian Generative Model for Hierarchies

Neural Information Processing Systems

In this work, we introduce TreeGen, a novel generative framework modeling distributions over hierarchies. We extend Bayesian Flow Networks (BFNs) to enable transitions between probabilistic and discrete hierarchies parametrized via categorical distributions. Our proposed scheduler provides smooth and consistent entropy decay across varying numbers of categories. We empirically evaluate TreeGen on the jet-clustering task in high-energy physics, demonstrating that it consistently generates valid trees that adhere to physical constraints and closely align with ground-truth log-likelihoods. Finally, by comparing TreeGen's samples to the exact posterior distribution and performing likelihood maximization via rejection sampling, we demonstrate that TreeGen outperforms various baselines.


Value-Guided KVCompression for LLMs via Approximated CURDecomposition

Neural Information Processing Systems

Key-value (KV) cache compression has emerged as a critical technique for reducing the memory and latency overhead of autoregressive language models during inference. Prior approaches predominantly rely on query-key attention scores to rank and evict cached tokens, assuming that attention intensity correlates with semantic importance. However, this heuristic overlooks the contribution of value vectors, which directly influence the attention output. In this paper, we propose CurDKV, a novel, value-centric KV compression method that selects keys and values based on leverage scores computed from CUR matrix decomposition. Our approach approximates the dominant subspace of the attention output softmax(QK)V, ensuring that the retained tokens best preserve the model's predictive behavior. Theoretically, we show that attention score approximation does not guarantee output preservation, and demonstrate that CUR-based selection minimizes end-to-end attention reconstruction loss.


Analyzing the Power of Chain of Thought through Memorization Capabilities

Neural Information Processing Systems

It has been shown that the chain of thought (CoT) can enhance the power of large language models (LLMs) to solve certain mathematical reasoning problems. However, the capacity of CoT is still not fully explored. As an important instance, the following basic question has not yet been answered: Does CoT expand the capability of transformers across all reasoning tasks? We demonstrate that reasoning with transformers is essentially a memorization problem for reasoning datasets.


Causal LLMRouting: End-to-End Regret Minimization from Observational Data

Neural Information Processing Systems

LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmaxweighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.