rwkv
Vision-QRWKV: Exploring Quantum-Enhanced RWKV Models for Image Classification
Recent advancements in quantum machine learning have shown promise in enhancing classical neural network architectures, particularly in domains involving complex, high-dimensional data. Building upon prior work in temporal sequence modeling, this paper introduces Vision-QRWKV, a hybrid quantum-classical extension of the Receptance Weighted Key Value (RWKV) architecture, applied for the first time to image classification tasks. By integrating a variational quantum circuit (VQC) into the channel mixing component of RWKV, our model aims to improve nonlinear feature transformation and enhance the expressive capacity of visual representations. We evaluate both classical and quantum RWKV models on a diverse collection of 14 medical and standard image classification benchmarks, including MedMNIST datasets, MNIST, and FashionMNIST. Our results demonstrate that the quantum-enhanced model outperforms its classical counterpart on a majority of datasets, particularly those with subtle or noisy class distinctions (e.g., ChestMNIST, RetinaMNIST, BloodMNIST). This study represents the first systematic application of quantum-enhanced RWKV in the visual domain, offering insights into the architectural trade-offs and future potential of quantum models for lightweight and efficient vision tasks.
- Health & Medicine > Health Care Technology (0.48)
- Health & Medicine > Diagnostic Medicine (0.48)
Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity
This paper investigates the efficacy of RWKV, a novel language model architecture known for its linear attention mechanism, for generating sentence embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate the semantic similarity captured by embeddings from different hidden layers of a pre-trained RWKV model. The performance is assessed on the Microsoft Research Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared against a GloVe-based baseline. My results indicate that while RWKV embeddings capture some semantic relatedness, they underperform compared to the GloVe baseline in terms of Spearman correlation. I also analyze the inference time and GPU memory usage, highlighting the computational trade-offs associated with RWKV embeddings. The findings suggest that while RWKV offers potential advantages in terms of linear scaling, its zero-shot sentence embedding quality for semantic similarity tasks requires further investigation and potential task-specific fine-tuning to match or exceed simpler baselines.
- North America > United States (0.14)
- Europe > Switzerland > Basel-City > Basel (0.04)
A Survey of RWKV
Li, Zhiyuan, Xia, Tingyu, Chang, Yi, Wu, Yuan
The Receptance Weighted Key Value (RWKV) model offers a novel alternative to the Transformer architecture, merging the benefits of recurrent and attention-based systems. Unlike conventional Transformers, which depend heavily on self-attention, RWKV adeptly captures long-range dependencies with minimal computational demands. By utilizing a recurrent framework, RWKV addresses some computational inefficiencies found in Transformers, particularly in tasks with long sequences. RWKV has recently drawn considerable attention for its robust performance across multiple domains. Despite its growing popularity, no systematic review of the RWKV model exists. This paper seeks to fill this gap as the first comprehensive review of the RWKV architecture, its core principles, and its varied applications, such as natural language generation, natural language understanding, and computer vision. We assess how RWKV compares to traditional Transformer models, highlighting its capability to manage long sequences efficiently and lower computational costs. Furthermore, we explore the challenges RWKV encounters and propose potential directions for future research and advancement. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/RWKV-Survey.
- North America > United States (0.04)
- Asia > China (0.04)
- Research Report (1.00)
- Overview (1.00)
- Health & Medicine (1.00)
- Education (0.92)
- Energy (0.92)
- Information Technology > Security & Privacy (0.67)
RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices
Choe, Wonkyo, Ji, Yangfeng, Lin, Felix Xiaozhu
To deploy LLMs on resource-contained platforms such as mobile robotics and wearables, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) models have shown promising results in text generation on resource-constrained devices thanks to their computational efficiency. However, these models remain too large to be deployed on embedded devices due to their high parameter count. In this paper, we propose an efficient suite of compression techniques, tailored to the RWKV architecture. These techniques include low-rank approximation, sparsity predictors, and clustering head, designed to align with the model size. Our methods compress the RWKV models by 4.95--3.8x with only 2.95pp loss in accuracy.
- North America > United States > Virginia > Albemarle County > Charlottesville (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
Video RWKV:Video Action Recognition Based RWKV
Yin, Zhuowen, Li, Chengru, Dong, Xingbo
To address the challenges of high computational costs and long-distance dependencies in exist ing video understanding methods, such as CNNs and Transformers, this work introduces RWKV to the video domain in a novel way. We propose a LSTM CrossRWKV (LCR) framework, designed for spatiotemporal representation learning to tackle the video understanding task. Specifically, the proposed linear complexity LCR incorporates a novel Cross RWKV gate to facilitate interaction be tween current frame edge information and past features, enhancing the focus on the subject through edge features and globally aggregating inter-frame features over time. LCR stores long-term mem ory for video processing through an enhanced LSTM recurrent execution mechanism. By leveraging the Cross RWKV gate and recurrent execution, LCR effectively captures both spatial and temporal features. Additionally, the edge information serves as a forgetting gate for LSTM, guiding long-term memory management.Tube masking strategy reduces redundant information in food and reduces overfitting.These advantages enable LSTM CrossRWKV to set a new benchmark in video under standing, offering a scalable and efficient solution for comprehensive video analysis. All code and models are publicly available.
- Asia > China > Anhui Province > Hefei (0.04)
- Asia > China > Liaoning Province > Shenyang (0.04)
The Evolution of RWKV: Advancements in Efficient Language Modeling
This paper reviews the development of the Receptance Weighted Key Value (RWKV) architecture, emphasizing its advancements in efficient language modeling. RWKV combines the training efficiency of Transformers with the inference efficiency of RNNs through a novel linear attention mechanism. We examine its core innovations, adaptations across various domains, and performance advantages over traditional models. The paper also discusses challenges and future directions for RWKV as a versatile architecture in deep learning.
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Sensorimotor Attention and Language-based Regressions in Shared Latent Variables for Integrating Robot Motion Learning and LLM
Suzuki, Kanata, Ogata, Tetsuya
In recent years, studies have been actively conducted on combining large language models (LLM) and robotics; however, most have not considered end-to-end feedback in the robot-motion generation phase. The prediction of deep neural networks must contain errors, it is required to update the trained model to correspond to the real environment to generate robot motion adaptively. This study proposes an integration method that connects the robot-motion learning model and LLM using shared latent variables. When generating robot motion, the proposed method updates shared parameters based on prediction errors from both sensorimotor attention points and task language instructions given to the robot. This allows the model to search for latent parameters appropriate for the robot task efficiently. Through simulator experiments on multiple robot tasks, we demonstrated the effectiveness of our proposed method from two perspectives: position generalization and language instruction generalization abilities.
A Unified Implicit Attention Formulation for Gated-Linear Recurrent Sequence Models
Zimerman, Itamar, Ali, Ameen, Wolf, Lior
Recent advances in efficient sequence modeling have led to attention-free layers, such as Mamba, RWKV, and various gated RNNs, all featuring sub-quadratic complexity in sequence length and excellent scaling properties, enabling the construction of a new type of foundation models. In this paper, we present a unified view of these models, formulating such layers as implicit causal self-attention layers. The formulation includes most of their sub-components and is not limited to a specific part of the architecture. The framework compares the underlying mechanisms on similar grounds for different layers and provides a direct means for applying explainability methods. Our experiments show that our attention matrices and attribution method outperform an alternative and a more limited formulation that was recently proposed for Mamba. For the other architectures for which our method is the first to provide such a view, our method is effective and competitive in the relevant metrics compared to the results obtained by state-of-the-art transformer explainability methods. Our code is publicly available.
RWKV: Reinventing RNNs for the Transformer Era
Peng, Bo, Alcaide, Eric, Anthony, Quentin, Albalak, Alon, Arcadinho, Samuel, Biderman, Stella, Cao, Huanqi, Cheng, Xin, Chung, Michael, Grella, Matteo, GV, Kranthi Kiran, He, Xuzheng, Hou, Haowen, Lin, Jiaju, Kazienko, Przemyslaw, Kocon, Jan, Kong, Jiaming, Koptyra, Bartlomiej, Lau, Hayden, Mantri, Krishna Sri Ipsit, Mom, Ferdinand, Saito, Atsushi, Song, Guangyu, Tang, Xiangru, Wang, Bolun, Wind, Johan S., Wozniak, Stanislaw, Zhang, Ruichong, Zhang, Zhenyuan, Zhao, Qihang, Zhou, Peng, Zhou, Qinghua, Zhu, Jian, Zhu, Rui-Jie
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Asia > Middle East > Jordan (0.04)
- (11 more...)
- Research Report > Promising Solution (0.48)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models
Luo, Haotian, Wu, Kunming, Dai, Cheng, Ding, Sixian, Chen, Xinhao
RNN-like language models are getting renewed attention from NLP researchers in recent years and several models have made significant progress, which demonstrates performance comparable to traditional transformers. However, due to the recurrent nature of RNNs, this kind of language model can only store information in a set of fixed-length state vectors. As a consequence, they still suffer from forgetfulness though after a lot of improvements and optimizations, when given complex instructions or prompts. As the prompted generation is the main and most concerned function of LMs, solving the problem of forgetting in the process of generation is no wonder of vital importance. In this paper, focusing on easing the prompt forgetting during generation, we proposed an architecture to teach the model memorizing prompt during generation by synthetic gradient. To force the model to memorize the prompt, we derive the states that encode the prompt, then transform it into model parameter modification using low-rank gradient approximation, which hard-codes the prompt into model parameters temporarily. We construct a dataset for experiments, and the results have demonstrated the effectiveness of our method in solving the problem of forgetfulness in the process of prompted generation. We will release all the code upon acceptance.