Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion

Zhang, Binchi, Zheng, Zaiyi, Chen, Zhengzhang, Li, Jundong

arXiv.org Artificial Intelligence 

For instance, in a two-layer MLP, permuting the rows of the weight matrix in the first Symmetry in the parameter space of deep neural layer and applying the corresponding inverse permutation to networks (DNNs) has proven beneficial for various the second layer results in a functionally equivalent model, deep learning applications. A well-known i.e., the outputs of the original and permuted models remain example is the permutation symmetry in Multi-identical for any given input (Ainsworth et al., 2023). Layer Perceptrons (MLPs), where permuting the All functionally equivalent models corresponding to weight rows of weight matrices in one layer and applying permutations form an equivalence set, which provides theoretical the inverse permutation to adjacent layers yields a insights into neural network optimization, such as functionally equivalent model. While permutation the linear mode connectivity of loss landscapes (Entezari symmetry fully characterizes the equivalence set et al., 2022; Zhou et al., 2023). In addition, permutation for MLPs, its discrete nature limits its utility for symmetry has also proven helpful in advancing neural network transformers. In this paper, we introduce rotation applications, such as model fusion (Singh & Jaggi, symmetry, a novel form of parameter space symmetry 2020; Ainsworth et al., 2023) and optimization (Zhao et al., for transformers that generalizes permutation 2024).