DeltaFormer: Unlock the State Space of Transformer

Neural Information Processing Systems 

In recent years, large language models built around the Transformer architecture have achieved breakthrough progress in many fields. At the same time, certain weaknesses in these models have prompted further reflection, with the most fundamental concerns centered on the Transformer architecture itself. The Transformer offers high parallelism and can fully exploit the computing power of GPUs, which has enabled it to replace models such as LSTM over the past few years. However, high parallelism is not a free advantage, as it imposes fundamental limits on model performance. In particular, the problems that the logarithmic-precision Transformer architecture can solve are strictly bounded within the class TC0.