Supernova: Achieving More with Less in Transformer Architectures

Tanase, Andrei-Valentin, Pelican, Elena

arXiv.org Artificial Intelligence 

The transformer architecture [1] has fundamentally transformed natural language processing, establishing itself as the dominant paradigm for language modeling and understanding tasks. However, the field's trajectory toward ever-larger models has created significant computational and economic challenges. Contemporary models such as OpenAI's GPT series, Anthropic's Claude, and Google's Gemini have pushed parameter counts into the hundreds of billions, resulting in unprecedented infrastructure costs that increasingly exceed the economic value these models generate in many practical applications. This scaling trajectory has reached a critical inflection point where the marginal benefits of additional parameters diminish rapidly while computational requirements grow exponentially. Despite this economic reality, there has been surprisingly limited systematic exploration of compact, efficient transformer architectures that could deliver comparable performance at sustainable computational costs. The prevailing assumption that model quality scales monotonically with parameter count has created a significant research gap in the sub-billion parameter regime, leaving unexplored the potential for architectural innovation to compensate for reduced scale. In this work, we challenge this scaling paradigm by presenting Supernova, a 650M parameter decoder-only transformer that demonstrates how careful architectural design and tokenization innovation can achieve performance comparable to significantly larger models while maintaining computational efficiency. Our approach is grounded in three fundamental principles: architectural efficiency through modern component integration, superior tokenization design, and dramatic improvements in data efficiency. 1