Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph
–arXiv.org Artificial Intelligence
Do autoregressive Transformer language models require explicit positional encodings (PEs)? The answer is "no" as long as they have more than one layer -- they can distinguish sequences with permuted tokens without requiring explicit PEs. This property has been known since early efforts (those contemporary with GPT-2) adopting the Transformer for language modeling. However, this result does not appear to have been well disseminated and was even rediscovered recently. This may be partially due to a sudden growth of the language modeling community after the advent of GPT-2, but perhaps also due to the lack of a clear explanation in prior publications, despite being commonly understood by practitioners in the past. Here we review this long-forgotten explanation why explicit PEs are nonessential for multi-layer autoregressive Transformers (in contrast, one-layer models require PEs to discern order information of their input tokens). We also review the origin of this result, and hope to re-establish it as a common knowledge.
arXiv.org Artificial Intelligence
Dec-31-2024
- Country:
- Asia
- China > Hong Kong (0.04)
- Japan (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.14)
- Singapore (0.04)
- Europe
- North America
- Canada > Alberta
- Dominican Republic (0.04)
- United States
- California > Los Angeles County
- Long Beach (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.05)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Maryland > Baltimore (0.04)
- Colorado > Denver County
- Denver (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- New York > Erie County
- Buffalo (0.04)
- California > Los Angeles County
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Asia
- Genre:
- Overview (0.67)
- Research Report (0.50)
- Technology: