Notes on the Mathematical Structure of GPT LLM Architectures
–arXiv.org Artificial Intelligence
Introduction When considered from a purely mathematical point of view, the building and training of a large (transformer) language model (LLM) is the construction of a function - which can be taken to be a map from some euclidean space to another - that has certain interesting properties. And therefore, from the point of view of a mathematician, it may be frustrating to find that many key papers announcing significant new LLMs seem reluctant to simply spell out the details of the function that they have constructed in plain mathematical language or indeed even in complete pseudo-code (and the latter form of this complaint appears to be one of the motivations behind a recent article of Phuong and Hutter [1]). Here, we seek to give a relatively'pure' mathematical description of the architecture of a GPT-3-style LLM. There is then a separate process - the training of the model - in which a particular value θ Θ is selected using a training algorithm. We will draw attention to such parameters as we introduce them, as opposed to attempting to give a definition of Θ up front.
arXiv.org Artificial Intelligence
Oct-25-2024