Goto

Collaborating Authors

 nc 1



2f2b265625d76a6704b08093c652fd79-Supplemental.pdf

Neural Information Processing Systems

A central challenge in training classification models in the real-world federated system is learning with non-IID data. To cope with this, most of the existing works involve enforcing regularization in local optimization or improving the model aggregation scheme at the server.



Probabilistic Skip Connections for Deterministic Uncertainty Quantification in Deep Neural Networks

Jimenez, Felix, Katzfuss, Matthias

arXiv.org Machine Learning

Deterministic uncertainty quantification (UQ) in deep learning aims to estimate uncertainty with a single pass through a network by leveraging outputs from the network's feature extractor. Existing methods require that the feature extractor be both sensitive and smooth, ensuring meaningful input changes produce meaningful changes in feature vectors. Smoothness enables generalization, while sensitivity prevents feature collapse, where distinct inputs are mapped to identical feature vectors. To meet these requirements, current deterministic methods often retrain networks with spectral normalization. Instead of modifying training, we propose using measures of neural collapse to identify an existing intermediate layer that is both sensitive and smooth. We then fit a probabilistic model to the feature vector of this intermediate layer, which we call a probabilistic skip connection (PSC). Through empirical analysis, we explore the impact of spectral normalization on neural collapse and demonstrate that PSCs can effectively disentangle aleatoric and epistemic uncertainty. Additionally, we show that PSCs achieve uncertainty quantification and out-of-distribution (OOD) detection performance that matches or exceeds existing single-pass methods requiring training modifications. By retrofitting existing models, PSCs enable high-quality UQ and OOD capabilities without retraining.


Beyond Unconstrained Features: Neural Collapse for Shallow Neural Networks with General Data

Hong, Wanli, Ling, Shuyang

arXiv.org Machine Learning

Neural collapse (NC) is a phenomenon that emerges at the terminal phase of the training (TPT) of deep neural networks (DNNs). The features of the data in the same class collapse to their respective sample means and the sample means exhibit a simplex equiangular tight frame (ETF). In the past few years, there has been a surge of works that focus on explaining why the NC occurs and how it affects generalization. Since the DNNs are notoriously difficult to analyze, most works mainly focus on the unconstrained feature model (UFM). While the UFM explains the NC to some extent, it fails to provide a complete picture of how the network architecture and the dataset affect NC. In this work, we focus on shallow ReLU neural networks and try to understand how the width, depth, data dimension, and statistical property of the training dataset influence the neural collapse. We provide a complete characterization of when the NC occurs for two or three-layer neural networks. For two-layer ReLU neural networks, a sufficient condition on when the global minimizer of the regularized empirical risk function exhibits the NC configuration depends on the data dimension, sample size, and the signal-to-noise ratio in the data instead of the network width. For three-layer neural networks, we show that the NC occurs as long as the first layer is sufficiently wide. Regarding the connection between NC and generalization, we show the generalization heavily depends on the SNR (signal-to-noise ratio) in the data: even if the NC occurs, the generalization can still be bad provided that the SNR in the data is too low. Our results significantly extend the state-of-the-art theoretical analysis of the N C under the UFM by characterizing the emergence of the N C under shallow nonlinear networks and showing how it depends on data properties and network architecture.


The Illusion of State in State-Space Models

Merrill, William, Petty, Jackson, Sabharwal, Ashish

arXiv.org Artificial Intelligence

State-space models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and state tracking (Merrill & Sabharwal, 2023), which SSMs are explicitly designed to address via their close architectural similarity to recurrent neural networks (RNNs). But do SSMs truly have an advantage (over transformers) in expressive power for state tracking? Surprisingly, the answer is no. Our analysis reveals that the expressive power of SSMs is limited very similarly to transformers: SSMs cannot express computation outside the complexity class $\mathsf{TC}^0$. In particular, this means they cannot solve simple state-tracking problems like permutation composition. It follows that SSMs are provably unable to accurately track chess moves with certain notation, evaluate code, or track entities in a long narrative. To supplement our formal analysis, we report experiments showing that Mamba-style SSMs indeed struggle with state tracking. Thus, despite its recurrent formulation, the "state" in an SSM is an illusion: SSMs have similar expressiveness limitations to non-recurrent models like transformers, which may fundamentally limit their ability to solve real-world state-tracking problems.


Rethinking Transformers in Solving POMDPs

Lu, Chenhao, Shi, Ruizhe, Liu, Yuyao, Hu, Kaizhe, Du, Simon S., Xu, Huazhe

arXiv.org Artificial Intelligence

Sequential decision-making algorithms such as reinforcement learning (RL) in real-world scenarios inevitably face environments with partial observability. This paper scrutinizes the effectiveness of a popular architecture, namely Transformers, in Partially Observable Markov Decision Processes (POMDPs) and reveals its theoretical limitations. We establish that regular languages, which Transformers struggle to model, are reducible to POMDPs. This poses a significant challenge for Transformers in learning POMDP-specific inductive biases, due to their lack of inherent recurrence found in other models like RNNs. This paper casts doubt on the prevalent belief in Transformers as sequence models for RL and proposes to introduce a point-wise recurrent structure. The Deep Linear Recurrent Unit (LRU) emerges as a well-suited alternative for Partially Observable RL, with empirical results highlighting the sub-optimal performance of the Transformer and considerable strength of LRU.


Attacking Power Indices by Manipulating Player Reliability

Istrate, Gabriel, Bonchiş, Cosmin, Brînduşescu, Alin

arXiv.org Artificial Intelligence

We investigate the manipulation of power indices in TU-cooperative games by stimulating (subject to a budget constraint) changes in the propensity of other players to participate to the game. We display several algorithms that show that the problem is often tractable for so-called network centrality games and influence attribution games, as well as an example when optimal manipulation is intractable, even though computing power indices is feasible.


Canonical Logic Programs are Succinctly Incomparable with Propositional Formulas

Shen, Yuping, Zhao, Xishun

arXiv.org Artificial Intelligence

\emph{Canonical (logic) programs} (CP) refer to normal logic programs augmented with connective $not\ not$. In this paper we address the question of whether CP are \emph{succinctly incomparable} with \emph{propositional formulas} (PF). Our main result shows that the PARITY problem, which can be polynomially represented in PF but \emph{only} has exponential representations in CP. In other words, PARITY \emph{separates} PF from CP. Simply speaking, this means that exponential size blowup is generally inevitable when translating a set of formulas in PF into an equivalent program in CP (without introducing new variables). Furthermore, since it has been shown by Lifschitz and Razborov that there is also a problem that separates CP from PF (assuming $\mathsf{P}\nsubseteq \mathsf{NC^1/poly}$), it follows that CP and PF are indeed succinctly incomparable. From the view of the theory of computation, the above result may also be considered as the separation of two \emph{models of computation}, i.e., we identify a language in $\mathsf{NC^1/poly}$ which is not in the set of languages computable by polynomial size CP programs.