Transformer Fusion with Optimal Transport

Imfeld, Moritz, Graldi, Jacopo, Giordano, Marco, Hofmann, Thomas, Anagnostidis, Sotiris, Singh, Sidak Pal

Oct-15-2023–arXiv.org Machine Learning

Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. In this paper, we present a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures - in principle - and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way for compression of Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Transformers, as introduced by Vaswani et al. (Vaswani et al., 2017), have profoundly impacted machine learning, establishing a prevailing neural network architecture across various domains. Transformers consistently excel in different fields, including natural language processing (Lin et al., 2022), time series forecasting (Wen et al., 2022), and computer vision (Dosovitskiy et al., 2020).

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

Oct-15-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland > Zürich > Zürich (0.14)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found