Arbitrary-Length Generalization for Addition in a Tiny Transformer

Jun-11-2024–arXiv.org Machine Learning

The Transformer architecture, as introduced by Vaswani et al. (2017), appears sufficiently robust to learn how to generalize addition, a fundamental operation (a+b=c) taught in elementary school. However, Nogueira et al. (2021) demonstrated that Transformers struggle to generalize this simple procedure effectively. Although some researchers have explored the use of both simplified and complex scratchpads to aid in training Transformers (Nye et al., 2021; Lee et al., 2024), they have not achieved generalization to numbers with arbitrary digit lengths. Recently, McLeish et al. (2024) argue that, by integrating an embedding for each digit that encodes its position relative to the start of the number, it is possible to train Transformers on 20-digit numbers and achieve approximately 99% accuracy on addition problems involving up to 100 digits. However, the authors do not study the accuracy for numbers exceeding 100 digits, which leaves an open question about the scalability of this approach to even larger numbers. This gap presents a significant opportunity for future research to explore the limits of Transformer generalization in arithmetic operations. I would like to thank Fernanda Cristiane de Oliveira for helping me to make parts of this work clearer.

digit, second type, transformer, (15 more...)

arXiv.org Machine Learning

Jun-11-2024

arXiv.org PDF

Add feedback

Country:
- South America > Brazil
  - São Paulo (0.05)
- Europe > Austria
  - Vienna (0.14)

Genre:
- Research Report (0.65)

Industry:
- Education > Educational Setting > K-12 Education (0.34)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks (0.34)
  - Inductive Learning (0.33)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found