Wozniak, Stanislaw
RWKV: Reinventing RNNs for the Transformer Era
Peng, Bo, Alcaide, Eric, Anthony, Quentin, Albalak, Alon, Arcadinho, Samuel, Biderman, Stella, Cao, Huanqi, Cheng, Xin, Chung, Michael, Grella, Matteo, GV, Kranthi Kiran, He, Xuzheng, Hou, Haowen, Lin, Jiaju, Kazienko, Przemyslaw, Kocon, Jan, Kong, Jiaming, Koptyra, Bartlomiej, Lau, Hayden, Mantri, Krishna Sri Ipsit, Mom, Ferdinand, Saito, Atsushi, Song, Guangyu, Tang, Xiangru, Wang, Bolun, Wind, Johan S., Wozniak, Stanislaw, Zhang, Ruichong, Zhang, Zhenyuan, Zhao, Qihang, Zhou, Peng, Zhou, Qinghua, Zhu, Jian, Zhu, Rui-Jie
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.
Learning in Deep Neural Networks Using a Biologically Inspired Optimizer
Dellaferrera, Giorgia, Wozniak, Stanislaw, Indiveri, Giacomo, Pantazi, Angeliki, Eleftheriou, Evangelos
Plasticity circuits in the brain are known to be influenced by the distribution of the synaptic weights through the mechanisms of synaptic integration and local regulation of synaptic strength. However, the complex interplay of stimulation-dependent plasticity with local learning signals is disregarded by most of the artificial neural network training algorithms devised so far. Here, we propose a novel biologically inspired optimizer for artificial (ANNs) and spiking neural networks (SNNs) that incorporates key principles of synaptic integration observed in dendrites of cortical neurons: GRAPES (Group Responsibility for Adjusting the Propagation of Error Signals). GRAPES implements a weight-distribution dependent modulation of the error signal at each node of the neural network. We show that this biologically inspired mechanism leads to a systematic improvement of the convergence rate of the network, and substantially improves classification accuracy of ANNs and SNNs with both feedforward and recurrent architectures. Furthermore, we demonstrate that GRAPES supports performance scalability for models of increasing complexity and mitigates catastrophic forgetting by enabling networks to generalize to unseen tasks based on previously acquired knowledge. The local characteristics of GRAPES minimize the required memory resources, making it optimally suited for dedicated hardware implementations. Overall, our work indicates that reconciling neurophysiology insights with machine intelligence is key to boosting the performance of neural networks.