AITopics

2501.15001

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.24)
Europe > United Kingdom > England (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

arXiv.org Machine LearningFeb-7-2025

Parameter Symmetry Breaking and Restoration Determines the Hierarchical Learning in AI Systems

Ziyin, Liu, Xu, Yizhou, Poggio, Tomaso, Chuang, Isaac

More and more phenomena that are virtually universal in the learning process have been discovered in contemporary AI systems. These phenomena are shared by models with different architectures, trained on different datasets, and with different training techniques. The existence of these universal phenomena calls for one or a few universal explanations. However, until today, most of the phenomena are instead described by narrow theories tailored to explain each phenomenon separately - often focusing on specific models trained on specific tasks or loss functions and in isolation from other interesting phenomena that are indispensable parts of the deep learning phenomenology. Certainly, it is desirable to have a universal perspective, if not a universal theory, that explains as many phenomena as possible. In the spirit of science, a universal perspective should be independent of system details such as variations in minor architecture definitions, choice of loss functions, training techniques, etc. A universal theory would give the field a simplified paradigm for thinking about and understanding AI systems and a potential design principle for a new generation of more efficient and capable models.

artificial intelligence, machine learning, symmetry, (16 more...)

2502.053

Country:

North America > United States (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningNov-20-2024

On Generalization Bounds for Neural Networks with Low Rank Layers

Pinto, Andrea, Rangamani, Akshay, Poggio, Tomaso

Deep learning has achieved remarkable success across a wide range of applications, including computer vision[2, 3], natural language processing [4, 5], decision-making in novel environments [6], and code generation [7], among others. Understanding the reasons behind the effectiveness of deep learning is a multifaceted challenge that involves questions about architectural choices, optimizer selection, and the types of inductive biases that can guarantee generalization. A long-standing question in this field is how deep learning finds solutions that generalize well. While good generalization performance by overparameterized models is not unique to deep learning--it can be explained by the implicit bias of learning algorithms towards low-norm solutions in linear models and kernel machines [8, 9]--the case of deep learning presents additional challenges. However in the case of deep learning, identifying the right implicit bias and obtaining generalization bounds that depend on this bias are still open questions. In recent years, Rademacher bounds have been developed to explain the complexity control induced by an important bias in deep network training: the minimization of weight matrix norms. This minimization occurs due to explicit or implicit regularization [10, 11, 12, 13]. For rather general network architectures, Golowich et al.[14] showed that the Rademacher complexity is linear in the product of the Frobenius norms of the various layers. Although the associated bounds are usually orders of magnitude larger than the generalization gap for dense networks, very recent results by Galanti et al. [15] demonstrate that for networks with structural sparsity in their weight matrices, such as convolutional networks, norm-based Rademacher bounds approach non-vacuity.

artificial intelligence, complexity, machine learning, (18 more...)

2411.13733

Country:

North America (0.46)
Europe (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceOct-25-2024

Training the Untrainable: Introducing Inductive Bias via Representational Alignment

Subramaniam, Vighnesh, Mayo, David, Conwell, Colin, Poggio, Tomaso, Katz, Boris, Cheung, Brian, Barbu, Andrei

We demonstrate that architectures which traditionally are considered to be ill-suited for a task can be trained using inductive biases from another architecture. Networks are considered untrainable when they overfit, underfit, or converge to poor results even when tuning their hyperparameters. For example, plain fully connected networks overfit on object recognition while deep convolutional networks without residual connections underfit. The traditional answer is to change the architecture to impose some inductive bias, although what that bias is remains unknown. We introduce guidance, where a guide network guides a target network using a neural distance function. The target is optimized to perform well and to match its internal representations, layer-by-layer, to those of the guide; the guide is unchanged. If the guide is trained, this transfers over part of the architectural prior and knowledge of the guide to the target. If the guide is untrained, this transfers over only part of the architectural prior of the guide. In this manner, we can investigate what kinds of priors different architectures place on untrainable networks such as fully connected networks. We demonstrate that this method overcomes the immediate overfitting of fully connected networks on vision tasks, makes plain CNNs competitive to ResNets, closes much of the gap between plain vanilla RNNs and Transformers, and can even help Transformers learn tasks which RNNs can perform more easily. We also discover evidence that better initializations of fully connected networks likely exist to avoid overfitting. Our method provides a mathematical tool to investigate priors and architectures, and in the long term, may demystify the dark art of architecture creation, even perhaps turning architectures into a continuous optimizable parameter of the network.

artificial intelligence, machine learning, natural language, (18 more...)

2410.20035

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.67)

Industry:

Government (0.93)
Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

arXiv.org Artificial IntelligenceOct-3-2024

Formation of Representations in Neural Networks

Ziyin, Liu, Chuang, Isaac, Galanti, Tomer, Poggio, Tomaso

Understanding neural representations will help open the black box of neural networks and advance our scientific understanding of modern AI systems. However, how complex, structured, and transferable representations emerge in modern neural networks has remained a mystery. Building on previous results, we propose the Canonical Representation Hypothesis (CRH), which posits a set of six alignment relations to universally govern the formation of representations in most hidden layers of a neural network. Under the CRH, the latent representations (R), weights (W), and neuron gradients (G) become mutually aligned during training. This alignment implies that neural networks naturally learn compact representations, where neurons and weights are invariant to task-irrelevant transformations. We then show that the breaking of CRH leads to the emergence of reciprocal power-law relations between R, W, and G, which we refer to as the Polynomial Alignment Hypothesis (PAH). We present a minimal-assumption theory demonstrating that the balance between gradient noise and regularization is crucial for the emergence the canonical representation. The CRH and PAH lead to an exciting possibility of unifying major key deep learning phenomena, including neural collapse and the neural feature ansatz, in a single framework.

artificial intelligence, machine learning, representation, (19 more...)

2410.03006

Country:

North America > United States (0.46)
Europe > Switzerland (0.28)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

arXiv.org Artificial IntelligenceSep-27-2024

On the Power of Decision Trees in Auto-Regressive Language Modeling

Gan, Yulu, Galanti, Tomer, Poggio, Tomaso, Malach, Eran

Originally proposed for handling time series data, Auto-regressive Decision Trees (ARDTs) have not yet been explored for language modeling. This paper delves into both the theoretical and practical applications of ARDTs in this new context. We theoretically demonstrate that ARDTs can compute complex functions, such as simulating automata, Turing machines, and sparse circuits, by leveraging "chain-of-thought" computations. Our analysis provides bounds on the size, depth, and computational efficiency of ARDTs, highlighting their surprising computational power. Empirically, we train ARDTs on simple language generation tasks, showing that they can learn to generate coherent and grammatically correct text on par with a smaller Transformer model. Additionally, we show that ARDTs can be used on top of transformer representations to solve complex reasoning tasks. This research reveals the unique computational abilities of ARDTs, aiming to broaden the architectural diversity in language model development.

large language model, machine learning, natural language, (19 more...)

2409.1915

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(3 more...)

arXiv.org Machine LearningJun-16-2024

How Neural Networks Learn the Support is an Implicit Regularization Effect of SGD

Beneventano, Pierfrancesco, Pinto, Andrea, Poggio, Tomaso

We investigate the ability of deep neural networks to identify the support of the target function. Our findings reveal that mini-batch SGD effectively learns the support in the first layer of the network by shrinking to zero the weights associated with irrelevant components of input. In contrast, we demonstrate that while vanilla GD also approximates the target function, it requires an explicit regularization term to learn the support in the first layer. We prove that this property of mini-batch SGD is due to a second-order implicit regularization effect which is proportional to $\eta / b$ (step size / batch size). Our results are not only another proof that implicit regularization has a significant impact on training optimization dynamics but they also shed light on the structure of the features that are learned by the network. Additionally, they suggest that smaller batches enhance feature interpretability and reduce dependency on initialization.

artificial intelligence, deep learning, machine learning, (18 more...)

2406.1111

Country:

North America > United States > Massachusetts (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

arXiv.org Machine LearningOct-25-2023

Characterizing the Implicit Bias of Regularized SGD in Rank Minimization

Galanti, Tomer, Siegel, Zachary S., Gupte, Aparna, Poggio, Tomaso

We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices and applies to a wide range of neural network architectures of any width or depth. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.

artificial intelligence, machine learning, weight decay 0, (11 more...)

2206.05794

Country: North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

arXiv.org Artificial IntelligenceAug-30-2023

System identification of neural systems: If we got it right, would we know?

Han, Yena, Poggio, Tomaso, Cheung, Brian

Artificial neural networks are being proposed as models of parts of the brain. The networks are compared to recordings of biological neurons, and good performance in reproducing neural responses is considered to support the model's validity. A key question is how much this system identification approach tells us about brain computation. Does it validate one model architecture over another? We evaluate the most commonly used comparison techniques, such as a linear encoding model and centered kernel alignment, to correctly identify a model by replacing brain recordings with known ground truth models. System identification performance is quite variable; it also depends significantly on factors independent of the ground truth architecture, such as stimuli images. In addition, we show the limitations of using functional similarity scores in identifying higher-level architectural motifs.

artificial intelligence, machine learning, system identification, (1 more...)

2302.06677

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)

arXiv.org Artificial IntelligenceJan-27-2023

Norm-based Generalization Bounds for Compositionally Sparse Neural Networks

Galanti, Tomer, Xu, Mengjia, Galanti, Liane, Poggio, Tomaso

In this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional filters instead of the norms of the associated Toeplitz matrices, independently of weight sharing between neurons. As we show theoretically, these bounds may be orders of magnitude better than standard norm-based generalization bounds and empirically, they are almost non-vacuous in estimating generalization in various simple classification problems. Taken together, these results suggest that compositional sparsity of the underlying target function is critical to the success of deep neural networks.

artificial intelligence, machine learning, neural network, (16 more...)

2301.12033

Country:

North America > United States > Massachusetts (0.28)
Europe (0.28)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)