Goto

Collaborating Authors

 sup


On the Expressive Power of Contextual Relations in Transformers

Fraiman, Demián

arXiv.org Machine Learning

Transformer architectures have achieved remarkable empirical success in modeling contextual relationships in natural language, yet a precise mathematical characterization of their expressive power remains incomplete. In this work, we introduce a measure-theoretic framework for contextual representations in which texts are modeled as probability measures over a semantic embedding space, and contextual relations between words, are represented as coupling measures between them. Within this setting, we introduce Sinkhorn Transformer, a transformer-like architecture. Our main result is a universal approximation theorem: any continuous coupling function between probability measures, that encodes the semantic relation coupling measure, can be uniformly approximated by a Sinkhorn Transformer with appropriate parameters.



A Additional definitions

Neural Information Processing Systems

We provide the definitions of important terms used throughout the paper. Assumption 2.3 when the demand distribution is exponential. Note that Lemma B.1 implies that In the following result, we show that there exist appropriate constants such that prior distribution satisfies Assumption 2.3 when the demand distribution is a multivariate Gaussian with unknown The proof is a direct consequence of Theorem 3.2, Lemmas B.6, B.7, B.8, B.9, and Proposition 3.2. Theorem 6.19] the prior induced by Assumption 2.2 is a direct consequence of Assumption 2.4 and 2.5 are straightforward to satisfy since the model risk function Lemma B.13. F or a given Using the result above together with Proposition 3.2 implies that the RSVB posterior converges at C.1 Alternative derivation of LCVB We present the alternative derivation of LCVB. We prove our main result after a series of important lemmas.