Transformers, parallel computation, and logarithmic depth

Sanford, Clayton, Hsu, Daniel, Telgarsky, Matus

Feb-14-2024–arXiv.org Artificial Intelligence

The transformer (Vaswani et al., 2017) has emerged as the dominant neural architecture for many sequential modeling tasks such as machine translation (Radford et al., 2019) and protein folding (Jumper et al., 2021). Reasons for the success of transformers include suitability to modern hardware and training stability: unlike in recurrent models, inference and training can be efficiently parallelized, and training is less vulnerable to vanishing and exploding gradients. However, the advantages of transformers over other neural architectures can be understood more fundamentally via the lens of representation, which regards neural nets as parameterized functions and asks what they can efficiently compute. Many previous theoretical studies of transformers establish (approximation-theoretic and computational) universality properties, but only at large model sizes (Yun et al., 2020; Pérez et al., 2021). These results are not unique to transformers and reveal little about which tasks can be solved in a size-efficient manner.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Feb-14-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > New York > New York County > New York City (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.45)
  - Natural Language (1.00)