Arbitrary-Length Generalization for Addition in a Tiny Transformer
The Transformer architecture, as introduced by Vaswani et al. (2017), appears sufficiently robust to learn how to generalize addition, a fundamental operation (a+b=c) taught in elementary school. However, Nogueira et al. (2021) demonstrated that Transformers struggle to generalize this simple procedure effectively. Although some researchers have explored the use of both simplified and complex scratchpads to aid in training Transformers (Nye et al., 2021; Lee et al., 2024), they have not achieved generalization to numbers with arbitrary digit lengths. Recently, McLeish et al. (2024) argue that, by integrating an embedding for each digit that encodes its position relative to the start of the number, it is possible to train Transformers on 20-digit numbers and achieve approximately 99% accuracy on addition problems involving up to 100 digits. However, the authors do not study the accuracy for numbers exceeding 100 digits, which leaves an open question about the scalability of this approach to even larger numbers. This gap presents a significant opportunity for future research to explore the limits of Transformer generalization in arithmetic operations. I would like to thank Fernanda Cristiane de Oliveira for helping me to make parts of this work clearer.
Jun-11-2024
- Genre:
- Research Report (0.65)
- Industry:
- Education > Educational Setting > K-12 Education (0.34)
- Technology: