Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention
–arXiv.org Artificial Intelligence
Yet, the theoretical advantages of multi-head versus single-head attention, beyond mere parallel processing, remain underex-plored. In this paper, we reframe multi-head attention as a system of potentially synergistic computational graphs, where each head functions as a feedforward directed acyclic graph (DAG) with a common sink state. We provide intuition and preliminary theoretical analysis of mixing time and minimax fidelity in this framework. Our results show that multi-head attention can synergistically enhance information propagation, yielding faster mixing times and minimax fidelity amplification under specific head-diversity conditions. Finally, we train single-head and multi-head Transformers, each with the same total number of parameters, on sequence manipulation tasks and empirically verify the predicted effects.
arXiv.org Artificial Intelligence
Nov-11-2025