attn
Subcritical Signal Propagation at Initialization in Normalization-Free Transformers
We study signal propagation at initialization in transformers through the averaged partial Jacobian norm (APJN), a measure of gradient amplification across layers. We extend APJN analysis to transformers with bidirectional attention and permutation-symmetric input token configurations by deriving recurrence relations for activation statistics and APJNs across layers. Our theory predicts how attention modifies the asymptotic behavior of the APJN at large depth and matches APJNs measured in deep vision transformers. The criticality picture known from residual networks carries over to transformers: the pre-LayerNorm architecture exhibits power-law APJN growth, whereas transformers with LayerNorm replaced by elementwise $\tanh$-like nonlinearities have stretched-exponential APJN growth, indicating that the latter are subcritical. Applied to Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers, the theory explains why these architectures can be more sensitive to initialization and optimization choices and require careful tuning for stable training.
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Italy > Sardinia (0.04)
LearningtoReasonIterativelyandParallellyfor ComplexVisualReasoningScenarios
Meanwhile, its"parallel" computation allowsforthesimultaneous explorationofdifferent reasoning paths andbenefits more robust and efficient execution of operations that are mutually independent (e.g. when counting individual colors for the query:"determine the maximum occurring color amongst all t-shirts"). We design IPRM as a lightweight and fully-differentiable neural module thatcanbeconveniently applied toboth transformer and non-transformer vision-language backbones.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Singapore (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Vision (0.68)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > Middle East > Israel (0.04)
- Information Technology > Artificial Intelligence > Vision (0.96)
- Information Technology > Artificial Intelligence > Natural Language (0.93)
- Information Technology > Sensing and Signal Processing > Image Processing (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
- Europe > Austria > Vienna (0.13)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- North America > United States > California (0.04)
- Asia > Singapore (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Europe > Portugal > Lisbon > Lisbon (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Tuscany > Florence (0.05)
- (9 more...)
- Asia > China > Beijing > Beijing (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
DelayedPropagationTransformer: AUniversalComputationEnginetowardsPractical ControlinCyber-PhysicalSystems
DePT induces a cone-shaped spatial-temporal attention prior,which injects theinformation propagation and aggregation principles and enables a global view. With physical constraint inductive bias baked into its design, our DePT is ready to plug and play for a broad class of multi-agent systems. The experimental results on one of the most challenging CPS - network-scale traffic signal control system in the open world - show that our model outperformed the state-of-the-art expert methods on synthetic and real-world datasets.
- Transportation > Ground > Road (0.66)
- Transportation > Infrastructure & Services (0.48)