Since their introduction in 2017, transformers have become the go-to machine learning architecture for natural language processing (NLP) and computer vision. Although they have achieved state-of-the-art performance in these fields, the theoretical framework underlying transformers remains relatively underexplored. In the new paper A Probabilistic Interpretation of Transformers, ML Collective researcher Alexander Shim provides a probabilistic explanation of transformers' exponential dot product attention and contrastive learning based on distributions of the exponential family. An oft-proposed explanation for transformers' power and performance is their attention mechanisms' superior ability to model dependencies in long input sequences. But this doesn't directly address how and why transformer architecture choices such as exponential dot product attention outperform the alternatives.
May-20-2022, 07:07:56 GMT