Goto

Collaborating Authors

 threshold circuit


Learning to Think from Multiple Thinkers

arXiv.org Machine Learning

We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.






A Proofs A.1 Proof of Theorem 3.1

Neural Information Processing Systems

Then, combining Lemmas A.1 and A.2 with Eq. 1 and 2, we have null ˆ N f null We start with an intuitive explanation, and then turn to the formal proof. We show that w.h.p. we obtain We use the standard two's complement binary By Lemma A.4, there is = The weights in layers 2,...,k are all in { 1, 1} . We extend this result to real-valued functions. We now turn to the formal proof. Thus, this inequality is linear.



On the Computational Capability of Graph Neural Networks: A Circuit Complexity Bound Perspective

arXiv.org Artificial Intelligence

Graph Neural Networks (GNNs) have become the standard approach for learning and reasoning over relational data, leveraging the message-passing mechanism that iteratively propagates node embeddings through graph structures. While GNNs have achieved significant empirical success, their theoretical limitations remain an active area of research. Existing studies primarily focus on characterizing GNN expressiveness through Weisfeiler-Lehman (WL) graph isomorphism tests. In this paper, we take a fundamentally different approach by exploring the computational limitations of GNNs through the lens of circuit complexity. Specifically, we analyze the circuit complexity of common GNN architectures and prove that under constraints of constant-depth layers, linear or sublinear embedding sizes, and polynomial precision, GNNs cannot solve key problems such as graph connectivity and graph isomorphism unless $\mathsf{TC}^0 = \mathsf{NC}^1$. These results reveal the intrinsic expressivity limitations of GNNs behind their empirical success and introduce a novel framework for analyzing GNN expressiveness that can be extended to a broader range of GNN models and graph decision problems.


Circuit Complexity Bounds for Visual Autoregressive Model

arXiv.org Machine Learning

Understanding the expressive ability of a specific model is essential for grasping its capacity limitations. Recently, several studies have established circuit complexity bounds for Transformer architecture. Besides, the Visual AutoRegressive (VAR) model has risen to be a prominent method in the field of image generation, outperforming previous techniques, such as Diffusion Transformers, in generating high-quality images. We investigate the circuit complexity of the VAR model and establish a bound in this study. Our primary result demonstrates that the VAR model is equivalent to a simulation by a uniform $\mathsf{TC}^0$ threshold circuit with hidden dimension $d \leq O(n)$ and $\mathrm{poly}(n)$ precision. This is the first study to rigorously highlight the limitations in the expressive power of VAR models despite their impressive performance. We believe our findings will offer valuable insights into the inherent constraints of these models and guide the development of more efficient and expressive architectures in the future.


Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers

arXiv.org Artificial Intelligence

Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ($\mathsf{RoPE}$) has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models' expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or $(A_{F,r})^*$ closure problems, under the assumption that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between the empirical performance and theoretical constraints of Tensor Attention and $\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that could guide the development of more theoretically grounded approaches to Transformer model design and scaling.