layer 3
Start Making Sense(s): A Developmental Probe of Attention Specialization Using Lexical Ambiguity
Rivière, Pamela D., Trott, Sean
Despite an in-principle understanding of self-attention matrix operations in Transformer language models (LMs), it remains unclear precisely how these operations map onto interpretable computations or functions--and how or when individual attention heads develop specialized attention patterns. Here, we present a pipeline to systematically probe attention mechanisms, and we illustrate its value by leveraging lexical ambiguity--where a single word has multiple meanings--to isolate attention mechanisms that contribute to word sense disambiguation. We take a "developmental" approach: first, using publicly available Pythia LM checkpoints, we identify inflection points in disambiguation performance for each LM in the suite; in 14M and 410M, we identify heads whose attention to disambiguating words covaries with overall disambiguation performance across development. We then stress-test the robustness of these heads to stimulus perturbations: in 14M, we find limited robustness, but in 410M, we identify multiple heads with surprisingly generalizable behavior. Then, in a causal analysis, we find that ablating the target heads demonstrably impairs disambiguation performance, particularly in 14M . We additionally reproduce developmental analyses of 14M across all of its random seeds. Together, these results suggest: that disambiguation benefits from a constellation of mechanisms, some of which (especially in 14M) are highly sensitive to the position and part-of-speech of the disambiguating cue; and that larger models (410M) may contain heads with more robust disambiguation behavior. They also join a growing body of work that highlights the value of adopting a developmental perspective when probing LM mechanisms.
- Asia > Middle East > Jordan (0.04)
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (3 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.88)
We thank all the reviewers for their constructive comments and useful suggestions
We thank all the reviewers for their constructive comments and useful suggestions. Q (R1): "Comparison with other methods like encoder" & "why do we need this technique" This is a very important point that we need to clarify in our paper. We will expand on this in the paper. As compared to GD-based methods, our algorithm is much more efficient. See appendix for time comparisons.
Leveraging Zipformer Model for Effective Language Identification in Code-Switched Child-Directed Speech
Shankar, Lavanya, Perera, Leibny Paola Garcia
This paper addresses this challenge by using Zipformer to handle the nuances of speech which contains two imbalanced languages - Mandarin and English - in an utterance. This work demonstrates that the internal layers of the Zipformer effectively encode the language characteristics, which can be leveraged in language identification. We present the selection methodology of the inner layers to extract the em-beddings and make a comparison with different back-ends. Our analysis shows that Zipformer is robust across these backends. Our approach effectively handles imbalanced data, achieving a Balanced Accuracy (BAC) of 81.89%, a 15.47% improvement over the language identification baseline.
- North America > United States (0.04)
- Asia > East Asia (0.04)
Enhancing variational quantum algorithms by balancing training on classical and quantum hardware
Bhowmick, Rahul, Wadhwa, Harsh, Singh, Avinash, Sidana, Tania, Tran, Quoc Hoan, Sabapathy, Krishna Kumar
Quantum computers offer a promising route to tackling problems that are classically intractable such as in prime-factorization, solving large-scale linear algebra and simulating complex quantum systems, but require fault-tolerant quantum hardware. On the other hand, variational quantum algorithms (VQAs) have the potential to provide a near-term route to quantum utility or advantage, and is usually constructed by using parametrized quantum circuits (PQCs) in combination with a classical optimizer for training. Although VQAs have been proposed for a multitude of tasks such as ground-state estimation, combinatorial optimization and unitary compilation, there remain major challenges in its trainability and resource costs on quantum hardware. Here we address these challenges by adopting Hardware Efficient and dynamical LIe algebra Supported Ansatz (HELIA), and propose two training schemes that combine an existing g-sim method (that uses the underlying group structure of the operators) and the Parameter-Shift Rule (PSR). Our improvement comes from distributing the resources required for gradient estimation and training to both classical and quantum hardware. We numerically test our proposal for ground-state estimation using Variational Quantum Eigensolver (VQE) and classification of quantum phases using quantum neural networks. Our methods show better accuracy and success of trials, and also need fewer calls to the quantum hardware on an average than using only PSR (upto 60% reduction), that runs exclusively on quantum hardware. We also numerically demonstrate the capability of HELIA in mitigating barren plateaus, paving the way for training large-scale quantum models.
- North America > United States > California (0.14)
- Asia > British Indian Ocean Territory > Diego Garcia (0.04)
- Europe > Monaco (0.04)
- (2 more...)
Powerformer: A Transformer with Weighted Causal Attention for Time-series Forecasting
Hegazy, Kareem, Mahoney, Michael W., Erichson, N. Benjamin
Transformers have recently shown strong performance in time-series forecasting, but their all-to-all attention mechanism overlooks the (temporal) causal and often (temporally) local nature of data. We introduce Powerformer, a novel Transformer variant that replaces noncausal attention weights with causal weights that are reweighted according to a smooth heavy-tailed decay. This simple yet effective modification endows the model with an inductive bias favoring temporally local dependencies, while still allowing sufficient flexibility to learn the unique correlation structure of each dataset. Our empirical results demonstrate that Powerformer not only achieves state-of-the-art accuracy on public time-series benchmarks, but also that it offers improved interpretability of attention patterns. Our analyses show that the model's locality bias is amplified during training, demonstrating an interplay between time-series data and power-law-based attention. These findings highlight the importance of domain-specific modifications to the Transformer architecture for time-series forecasting, and they establish Powerformer as a strong, efficient, and principled baseline for future research and real-world applications.
- Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (3 more...)
- Energy (0.67)
- Government (0.45)
Low-Rank Agent-Specific Adaptation (LoRASA) for Multi-Agent Policy Learning
Zhang, Beining, Kapoor, Aditya, Sun, Mingfei
Multi-agent reinforcement learning (MARL) often relies on \emph{parameter sharing (PS)} to scale efficiently. However, purely shared policies can stifle each agent's unique specialization, reducing overall performance in heterogeneous environments. We propose \textbf{Low-Rank Agent-Specific Adaptation (LoRASA)}, a novel approach that treats each agent's policy as a specialized ``task'' fine-tuned from a shared backbone. Drawing inspiration from parameter-efficient transfer methods, LoRASA appends small, low-rank adaptation matrices to each layer of the shared policy, naturally inducing \emph{parameter-space sparsity} that promotes both specialization and scalability. We evaluate LoRASA on challenging benchmarks including the StarCraft Multi-Agent Challenge (SMAC) and Multi-Agent MuJoCo (MAMuJoCo), implementing it atop widely used algorithms such as MAPPO and A2PO. Across diverse tasks, LoRASA matches or outperforms existing baselines \emph{while reducing memory and computational overhead}. Ablation studies on adapter rank, placement, and timing validate the method's flexibility and efficiency. Our results suggest LoRASA's potential to establish a new norm for MARL policy parameterization: combining a shared foundation for coordination with low-rank agent-specific refinements for individual specialization.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
Attention Sinks and Outlier Features: A 'Catch, Tag, and Release' Mechanism for Embeddings
Zhang, Stephen, Khan, Mustafa, Papyan, Vardan
Two prominent features of large language models (LLMs) is the presence of large-norm (outlier) features and the tendency for tokens to attend very strongly to a select few tokens. Despite often having no semantic relevance, these select tokens, called attention sinks, along with the large outlier features, have proven important for model performance, compression, and streaming. Consequently, investigating the roles of these phenomena within models and exploring how they might manifest in the model parameters has become an area of active interest. Through an empirical investigation, we demonstrate that attention sinks utilize outlier features to: catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream, where the tagged tokens are eventually retrieved. We prove that simple tasks, like averaging, necessitate the 'catch, tag, release' mechanism hence explaining why it would arise organically in modern LLMs. Our experiments also show that the creation of attention sinks can be completely captured in the model parameters using low-rank matrices, which has important implications for model compression and substantiates the success of recent approaches that incorporate a low-rank term to offset performance degradation.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Middle East > Jordan (0.04)
Bayesian Optimization with Preference Exploration by Monotonic Neural Network Ensemble
Wang, Hanyang, Branke, Juergen, Poloczek, Matthias
In MOO, there is usually not a single optimal solution, but a range of so-called Pareto optimal or non-dominated Many real-world black-box optimization problems solutions with different trade-offs. A widely adopted approach have multiple conflicting objectives. Rather aims to search for a good representation of these than attempting to approximate the entire set of Pareto-optimal solutions by maximizing their hypervolume. Pareto-optimal solutions, interactive preference Two prominent methods stand out in this regard: ParEGO learning, i.e., optimization with a decision maker (Knowles, 2006), which employs random augmented Chebyshev in the loop, allows to focus the search on the scalarizations for optimization in each iteration, and most relevant subset. However, few previous studies expected hypervolume maximization (Yang et al., 2019; have exploited the fact that utility functions Daulton et al., 2020), which directly maximizes the hypervolume are usually monotonic.
- North America > United States > California > San Mateo County > Menlo Park (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- Europe > United Kingdom > England > West Midlands > Coventry (0.04)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)