Sequence Length Independent Norm-Based Generalization Bounds for Transformers
Since Vaswani et al. (2017) debuted the Transformer, it has become one of the most preeminent architectures of its time. It has achieved state of the art prediction capabilities in various fields (Dosovitskiy et al., 2020; Wu et al., 2022; Vaswani et al., 2017; Pettersson and Falkman, 2023) and an implementation of it has even passed the BAR exam (Katz et al., 2023). With such widespread use, the theoretical underpinnings of this architecture are of great interest. Specifically, this paper is concerned with bounding the generalization error when using the Transformer in supervised learning. Upper bounding this can be used to help understand how sample size needs to scale with different architecture parameters and is a very common theoretical tool to understand machine learning algorithms (Kakade et al., 2008; Garg et al., 2020; Truong, 2022; Lin and Zhang, 2019).
Oct-19-2023