Transformers from an Optimization Perspective
–Neural Information Processing Systems
Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass?
Neural Information Processing Systems
Dec-25-2025, 16:22:57 GMT
- Technology: