On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation
Diep, Nghiem T., Nguyen, Huy, Nguyen, Chau, Le, Minh, Nguyen, Duy M. H., Sonntag, Daniel, Niepert, Mathias, Ho, Nhat
–arXiv.org Artificial Intelligence
The LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.
arXiv.org Artificial Intelligence
Feb-5-2025
- Country:
- Africa (0.04)
- Asia
- Middle East > Jordan (0.04)
- Vietnam > Hồ Chí Minh City
- Hồ Chí Minh City (0.04)
- Europe
- Germany > Baden-Württemberg
- Stuttgart Region > Stuttgart (0.04)
- Romania > Sud - Muntenia Development Region
- Giurgiu County > Giurgiu (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Germany > Baden-Württemberg
- North America > United States
- Texas > Travis County > Austin (0.04)
- Genre:
- Research Report > New Finding (0.34)
- Technology: