Understanding Counting in Small Transformers: The Interplay between Attention and Feed-Forward Layers

Open in new window