Not enough data to create a plot.
Try a different view from the menu above.
He, Zhenyu
Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning
Yu, Qifan, He, Zhenyu, Li, Sijie, Zhou, Xun, Zhang, Jun, Xu, Jingjing, He, Di
Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at https://github.com/qifanyu/RELAY.
PRSI: Privacy-Preserving Recommendation Model Based on Vector Splitting and Interactive Protocols
Cao, Xiaokai, Mo, Wenjin, He, Zhenyu, Wang, Changdong
With the development of the internet, recommending interesting products to users has become a highly valuable research topic for businesses. Recommendation systems play a crucial role in addressing this issue. To prevent the leakage of each user's (client's) private data, Federated Recommendation Systems (FedRec) have been proposed and widely used. However, extensive research has shown that FedRec suffers from security issues such as data privacy leakage, and it is challenging to train effective models with FedRec when each client only holds interaction information for a single user. To address these two problems, this paper proposes a new privacy-preserving recommendation system (PRSI), which includes a preprocessing module and two main phases. The preprocessing module employs split vectors and fake interaction items to protect clients' interaction information and recommendation results. The two main phases are: (1) the collection of interaction information and (2) the sending of recommendation results. In the interaction information collection phase, each client uses the preprocessing module and random communication methods (according to the designed interactive protocol) to protect their ID information and IP addresses. In the recommendation results sending phase, the central server uses the preprocessing module and triplets to distribute recommendation results to each client under secure conditions, following the designed interactive protocol. Finally, we conducted multiple sets of experiments to verify the security, accuracy, and communication cost of the proposed method.
Learning Spatial-Semantic Features for Robust Video Object Segmentation
Li, Xin, Miao, Deshui, He, Zhenyu, Wang, Yaowei, Lu, Huchuan, Yang, Ming-Hsuan
Tracking and segmenting multiple similar objects with complex or separate parts in long-term videos is inherently challenging due to the ambiguity of target parts and identity confusion caused by occlusion, background clutter, and long-term variations. In this paper, we propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic network comprising a semantic embedding block and spatial dependencies modeling block to associate the pretrained ViT features with global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation and ensuring effective long-term query propagation. The experimental results show that the proposed method set a new state-of-the-art performance on multiple datasets, including the DAVIS2017 test (89.1%), YoutubeVOS 2019 (88.5%), MOSE (75.1%), LVOS test (73.0%), and LVOS val (75.1%), which demonstrate the effectiveness and generalization capacity of the proposed method. We will make all source code and trained models publicly available.
Let the Code LLM Edit Itself When You Edit the Code
He, Zhenyu, Zhang, Jun, Luo, Shengjie, Xu, Jingjing, Zhang, Zhi, He, Di
In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the next token or next line on the fly. Naively, the LLM needs to re-encode the entire KV cache to provide an accurate prediction. However, this process is computationally expensive, especially when the sequence length is long. Simply encoding the edited subsequence and integrating it to the original KV cache meets the temporal confusion problem, leading to significantly worse performance. We address this efficiency and accuracy trade-off by introducing \underline{\textbf{Positional \textbf{I}ntegrity \textbf{E}ncoding} (PIE). Building upon the rotary positional encoding, PIE first removes the rotary matrices in the Key cache that introduce temporal confusion and then reapplies the correct rotary matrices. This process ensures that positional relationships between tokens are correct and requires only a single round of matrix multiplication. We validate the effectiveness of PIE through extensive experiments on the RepoBench-C-8k dataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.
REST: Retrieval-Based Speculative Decoding
He, Zhenyu, Zhong, Zexuan, Cai, Tianle, Lee, Jason D., He, Di
We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language models, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation. The code of REST is available at https://github.com/FasterDecoding/REST.
Do Efficient Transformers Really Save Computation?
Yang, Kai, Ackermann, Jan, He, Zhenyu, Feng, Guhao, Zhang, Bohang, Feng, Yunzhen, Ye, Qiwei, He, Di, Wang, Liwei
As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses.
Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation
He, Zhenyu, Feng, Guhao, Luo, Shengjie, Yang, Kai, He, Di, Xu, Jingjing, Zhang, Zhi, Yang, Hongxia, Wang, Liwei
In this work, we leverage the intrinsic segmentation of language sequences and design a new positional encoding method called Bilevel Positional Encoding (BiPE). For each position, our BiPE blends an intra-segment encoding and an inter-segment encoding. The intra-segment encoding identifies the locations within a segment and helps the model capture the semantic information therein via absolute positional encoding. The inter-segment encoding specifies the segment index, models the relationships between segments, and aims to improve extrapolation capabilities via relative positional encoding. Theoretical analysis shows this disentanglement of positional information makes learning more effective. The empirical results also show that our BiPE has superior length extrapolation capabilities across a wide range of tasks in diverse text modalities.
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
Fei, Junjie, Wang, Teng, Zhang, Jinrui, He, Zhenyu, Wang, Chengjie, Zheng, Feng
Image-to-text generation aims to describe images using natural language. Recently, zero-shot image captioning based on pre-trained vision-language models (VLMs) and large language models (LLMs) has made significant progress. However, we have observed and empirically demonstrated that these methods are susceptible to modality bias induced by LLMs and tend to generate descriptions containing objects (entities) that do not actually exist in the image but frequently appear during training (i.e., object hallucination). In this paper, we propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios. ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image, enabling coherent caption generation across diverse scenes. With entity-aware hard prompts, ViECap is capable of maintaining performance when transferring from in-domain to out-of-domain scenarios. Extensive experiments demonstrate that ViECap sets a new state-of-the-art cross-domain (transferable) captioning and performs competitively in-domain captioning compared to previous VLMs-based zero-shot methods. Our code is available at: https://github.com/FeiElysia/ViECap