parity problem
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Social Sector (0.46)
- Government (0.45)
Feature learning via mean-field Langevin dynamics: classifying sparse parities and beyond Taiji Suzuki 1,2, Denny Wu
Langevin dynamics (MFLD) (Mei et al., 2018; Hu et al., 2019) is particularly attractive due to the MFLD arises from a noisy gradient descent update on the parameters, where Gaussian noise is injected to the gradient to encourage "exploration". Furthermore, uniform-in-time estimates of the particle discretization error have also been established (Suzuki et al., The goal of this work is to address the following question.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
Matching the Statistical Query Lower Bound for k -Sparse Parity Problems with Sign Stochastic Gradient Descent
The $k$-sparse parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the $k$-sparse parity problem with sign stochastic gradient descent, a variant of stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that this approach can efficiently solve the $k$-sparse parity problem on a $d$-dimensional hypercube ($k\le O(\sqrt{d})$) with a sample complexity of $\tilde{O}(d^{k-1})$ using $2^{\Theta(k)}$ neurons, matching the established $\Omega(d^{k})$ lower bounds of Statistical Query (SQ) models. Our theoretical analysis begins by constructing a good neural network capable of correctly solving the $k$-parity problem. We then demonstrate how a trained neural network with sign SGD can effectively approximate this good network, solving the $k$-parity problem with small statistical errors. To the best of our knowledge, this is the first result that matches the SQ lower bound for solving $k$-sparse parity problem using gradient-based methods.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Social Sector (0.46)
- Government (0.45)
Matching the Statistical Query Lower Bound for k -Sparse Parity Problems with Sign Stochastic Gradient Descent
The k -sparse parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the k -sparse parity problem with sign stochastic gradient descent, a variant of stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that this approach can efficiently solve the k -sparse parity problem on a d -dimensional hypercube ( k\le O(\sqrt{d})) with a sample complexity of \tilde{O}(d {k-1}) using 2 {\Theta(k)} neurons, matching the established \Omega(d {k}) lower bounds of Statistical Query (SQ) models. Our theoretical analysis begins by constructing a good neural network capable of correctly solving the k -parity problem. We then demonstrate how a trained neural network with sign SGD can effectively approximate this good network, solving the k -parity problem with small statistical errors.
Task Generalization With AutoRegressive Compositional Structure: Can Learning From $\d$ Tasks Generalize to $\d^{T}$ Tasks?
Abedsoltan, Amirhesam, Zhang, Huaqing, Wen, Kaiyue, Lin, Hongzhou, Zhang, Jingzhao, Belkin, Mikhail
Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of AutoRegressive Compositional (ARC) structure, where each task is a composition of $T$ operations, and each operation is among a finite family of $\d$ subtasks. This yields a total class of size~\( \d^\TT \). We first show that generalization to all \( \d^\TT \) tasks is theoretically achievable by training on only \( \tilde{O}(\d) \) tasks. Empirically, we demonstrate that Transformers achieve such exponential task generalization on sparse parity functions via in-context learning (ICL) and Chain-of-Thought (CoT) reasoning. We further demonstrate this generalization in arithmetic and language translation, extending beyond parity functions.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.46)
From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Wen, Kaiyue, Zhang, Huaqing, Lin, Hongzhou, Zhang, Jingzhao
Chain-of-thought (CoT) has proven to be a powerful technique for enhancing reasoning in large language models [29, 63]. By instructing the model to break complex problems into smaller, manageable steps, CoT facilitates more efficient reasoning and better generalization, particularly in algorithmic and logical tasks [32, 45, 60]. Building on this, performance can be further improved through multi-step prompting and multi-path sampling techniques [10, 20, 59, 74, 75]. This focus on CoT within in-context learning has since expanded to more structured learning approaches [6, 69]. By adding reasoning examples of CoT style to the instruction-tuning dataset, models enhance their problem-solving abilities more effectively than relying solely on CoT during prompting [11, 72]. As a result, CoT is now shaping a new paradigm in language model development, marking a shift from simply scaling data [22, 25] to focusing on advanced reasoning strategies [39], which leads to more effective learning outcomes. While CoT's success is well-established, understanding why it works is still a hotly debated topic [48, 51]. Recent theoretical studies suggest that CoT enhances a model's expressiveness, increasing its representational capacity when the sequence is long enough [18, 37]. However, expressivity alone does not guarantee success.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.87)