Implementing a Transformer From Scratch
To get intimately familiar with the nuts and bolts of transformers I decided to implement the original architecture from Vaswani et al.'s "Attention is all you need" paper from scratch. I thought I knew everything there was to know, but to my own surprise, I encountered several unexpected implementation details that made me better understand how everything works under the hood. The goal of this post is not discuss the entire implementation -- there are plenty of great resources for that -- but to highlight seven things that I found particularly surprising or insightful, and that you might not know about. I will make this concrete by pointing to specific lines in my code using this hyperlink robot (try it!). The code should be easily understandable: it's well documented and automatically unit tested and type checked using Github Actions.
Mar-24-2022, 00:19:36 GMT
- Technology: