Towards smaller, faster decoder-only transformers: Architectural variants and their implications
Suresh, Sathya Krishnan, P, Shunmugapriya
–arXiv.org Artificial Intelligence
Since the debut of ChatGPT, there has been a notable increase in research on Large Language Models (LLMs) across a broad range of disciplines, made possible by the accessibility of this technology to a diverse user base. This fastly growing field has largely pursued two distinct paths: one aims at either scaling the model dimensions or the training dataset (or both) to enhance performance, while the other concentrates on refining smaller models (ranging from 1B to 7B parameters) with high-quality data. Despite these advances, investigations into the structural modifications of the transformer architecture itself have been relatively overlooked. Recent studies challenge the necessity of perpetually increasing model sizes by demonstrating that the deeper layers of LLMs may have minimal influence on predictive outcomes. In this work, we explore modifications to the decoder-only transformer architecture to address current challenges in the scalability and practical application of Large Language Models (LLMs).
arXiv.org Artificial Intelligence
Apr-23-2024