AITopics | attention path

Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i.e., N,P\rightarrow\infty, P/N \mathcal{O}(1), where N is the network width and P is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different "attention paths", defined as information pathways through different attention heads across layers. The kernels are weighted according to a "task-relevant kernel combination" mechanism that aligns the total kernel with the task labels.

attention path, statistical mechanics theory, transformer, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.62)

Add feedback

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Tiberi, Lorenzo, Mignacco, Francesca, Irie, Kazuki, Sompolinsky, Haim

arXiv.org Machine LearningMay-24-2024

Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i.e., $N,P\rightarrow\infty$, $P/N=\mathcal{O}(1)$, where $N$ is the network width and $P$ is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different 'attention paths', defined as information pathways through different attention heads across layers. The kernels are weighted according to a 'task-relevant kernel combination' mechanism that aligns the total kernel with the task labels. As a consequence, this interplay between attention paths enhances generalization performance. Experiments confirm our findings on both synthetic and real-world sequence classification tasks. Finally, our theory explicitly relates the kernel combination mechanism to properties of the learned weights, allowing for a qualitative transfer of its insights to models trained via gradient descent. As an illustration, we demonstrate an efficient size reduction of the network, by pruning those attention heads that are deemed less relevant by our theory.

artificial intelligence, attention path, machine learning, (16 more...)

arXiv.org Machine Learning

2405.15926

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Maryland > Baltimore (0.14)
Europe > Austria > Vienna (0.14)
(14 more...)

Genre: Research Report > New Finding (0.34)

Add feedback

Investigating the dynamics of hand and lips in French Cued Speech using attention mechanisms and CTC-based decoding

Sankar, Sanjana, Beautemps, Denis, Elisei, Frédéric, Perrotin, Olivier, Hueber, Thomas

arXiv.org Artificial IntelligenceJun-14-2023

Hard of hearing or profoundly deaf people make use of cued speech (CS) as a communication tool to understand spoken language. By delivering cues that are relevant to the phonetic information, CS offers a way to enhance lipreading. In literature, there have been several studies on the dynamics between the hand and the lips in the context of human production. This article proposes a way to investigate how a neural network learns this relation for a single speaker while performing a recognition task using attention mechanisms. Further, an analysis of the learnt dynamics is utilized to establish the relationship between the two modalities and extract automatic segments. For the purpose of this study, a new dataset has been recorded for French CS. Along with the release of this dataset, a benchmark will be reported for word-level recognition, a novelty in the automatic recognition of French CS.

cued speech, machine learning, recognition, (17 more...)

arXiv.org Artificial Intelligence

2306.0829

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Netherlands (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre: Research Report (0.65)

Industry: Health & Medicine > Therapeutic Area > Otolaryngology (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Speech (0.90)
(2 more...)

Add feedback

Filters

Collaborating Authors

attention path

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers Lorenzo Tiberi 1,2 Francesca Mignacco

8523a98265ceae12afd34113aa6c5cca-Paper-Conference.pdf

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Investigating the dynamics of hand and lips in French Cued Speech using attention mechanisms and CTC-based decoding