Not enough data to create a plot.
Try a different view from the menu above.
Yang, Fan
EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices
Chen, Jiyu, Peng, Shuang, Luo, Daxiong, Yang, Fan, Wu, Renshou, Li, Fangyuan, Chen, Xiaoxin
Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache optimizations struggle with irreversible token eviction in long-output tasks, while alternative sequence modeling architectures prove costly to adopt within established Transformer infrastructure. We present EdgeInfinite, a memory-efficient solution for infinite contexts that integrates compressed memory into Transformer-based LLMs through a trainable memory-gating module. This approach maintains full compatibility with standard Transformer architectures, requiring fine-tuning only a small part of parameters, and enables selective activation of the memory-gating module for long and short context task routing. The experimental result shows that EdgeInfinite achieves comparable performance to baseline Transformer-based LLM on long context benchmarks while optimizing memory consumption and time to first token.
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Chen, Mingyang, Li, Tianpeng, Sun, Haoze, Zhou, Yijie, Zhu, Chenzheng, Wang, Haofen, Pan, Jeff Z., Zhang, Wen, Chen, Huajun, Yang, Fan, Zhou, Zenan, Chen, Weipeng
Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.
A Universal Model Combining Differential Equations and Neural Networks for Ball Trajectory Prediction
Shi, Zhiwei, Zhu, Chengxi, Yang, Fan, Yan, Jun, Qin, Zheyun, Shi, Songquan, Chen, Zhumin
This paper presents a data driven universal ball trajectory prediction method integrated with physics equations. Existing methods are designed for specific ball types and struggle to generalize. This challenge arises from three key factors. First, learning-based models require large datasets but suffer from accuracy drops in unseen scenarios. Second, physics-based models rely on complex formulas and detailed inputs, yet accurately obtaining ball states, such as spin, is often impractical. Third, integrating physical principles with neural networks to achieve high accuracy, fast inference, and strong generalization remains difficult. To address these issues, we propose an innovative approach that incorporates physics-based equations and neural networks. We first derive three generalized physical formulas. Then, using a neural network and observed trajectory points, we infer certain parameters while fitting the remaining ones. These formulas enable precise trajectory prediction with minimal training data: only a few dozen samples. Extensive experiments demonstrate our method superiority in generalization, real-time performance, and accuracy.
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning
Zhan, Yufei, Zhu, Yousong, Zheng, Shurong, Zhao, Hongyin, Yang, Fan, Tang, Ming, Wang, Jinqiao
Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning. Recently, preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy to enhance capabilities of LVLMs. However, constructing high-quality human-annotated preference data and developing robust reward models to mimic these preferences are both costly and challenging. Motivated by this observation, we propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. It only leverages curated instruction data, eliminating the need for specialized reward models and handcrafted preference datasets. We incorporate a criterion-driven reward function that further integrates multi-dimensional feedback to evaluate model completions comprehensively based on the vision task logic. Furthermore, we introduce a progressive rule refinement strategy that dynamically adjusts the reward criteria during training, enabling continuous model improvement and mitigating reward hacking. Extensive experiments on both in-distribution and out-of-distribution benchmarks demonstrate that fine-tuning the 7B LVLMs with Vision-R1 achieves consistent performance gains, with even up to 50% improvement and surpassing the state-of-the-art 10x size model.
External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation
Liang, Mingfu, Liu, Xi, Jin, Rong, Liu, Boyang, Suo, Qiuling, Zhou, Qinghai, Zhou, Song, Chen, Laming, Zheng, Hua, Li, Zhiyuan, Jiang, Shali, Yang, Jiyan, Xia, Xiaozhen, Yang, Fan, Badr, Yasmine, Wen, Ellie, Xu, Shuyu, Chen, Hansey, Zhang, Zhengyu, Nie, Jade, Yang, Chunzhi, Zeng, Zhichen, Zhang, Weilin, Huang, Xingliang, Li, Qianru, Wang, Shiquan, Lyu, Evelyn, Lu, Wenjing, Zhang, Rui, Wang, Wenjun, Rudy, Jason, Hang, Mengyue, Wang, Kai, Ma, Yinbin, Wang, Shuaiwen, Zeng, Sihan, Tang, Tongyi, Wei, Xiaohan, Jin, Longhao, Zhang, Jamey, Chen, Marcus, Zhang, Jiayi, Huang, Angie, Zhang, Chi, Zhao, Zhengli, Yang, Jared, Jin, Qiang, Chen, Xian, Amlesahwaram, Amit Anand, Song, Lexi, Luo, Liang, Hao, Yuchen, Xiao, Nan, Yetim, Yavuz, Pan, Luoshang, Liu, Gaoxiang, Hu, Yuxi, Huang, Yuzhen, Xu, Jackie, Zhu, Rich, Zhang, Xin, Liu, Yiqun, Yin, Hang, Chen, Yuxin, Zhang, Buyun, Liu, Xiaoyi, Wang, Xingyuan, Mao, Wenguang, Li, Zhijing, Huang, Qin, Sun, Chonglin, Yu, Nancy, Gu, Shuo, Mao, Shupin, Au, Benjamin, Qin, Jingzheng, Yao, Peggy, Choi, Jae-Woo, Gao, Bin, Wang, Ernest, Zhang, Lei, Chen, Wen-Yen, Lee, Ted, Zha, Jay, Meng, Yi, Gong, Alex, Gao, Edison, Vahdatpour, Alireza, Han, Yiping, Yao, Yantao, Kureha, Toshinari, Chang, Shuo, Sultan, Musharaf, Bocharov, John, Chordia, Sagar, Gan, Xiaorui, Sun, Peng, Liu, Rocky, Long, Bo, Chen, Wenlin, Kolay, Santanu, Li, Huayu
Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications. First, training and inference budgets are restricted for the model to be served, exceeding which may incur latency and impair user experience. Second, large-volume data arrive in a streaming mode with data distributions dynamically shifting, as new users/ads join and existing users/ads leave the system. We propose the External Large Foundation Model (ExFM) framework to address the overlooked challenges. Specifically, we develop external distillation and a data augmentation system (DAS) to control the computational cost of training/inference while maintaining high performance. We design the teacher in a way like a foundation model (FM) that can serve multiple students as vertical models (VMs) to amortize its building cost. We propose Auxiliary Head and Student Adapter to mitigate the data distribution gap between FM and VMs caused by the streaming data issue. Comprehensive experiments on internal industrial-scale applications and public datasets demonstrate significant performance gain by ExFM.
One-Shot Affordance Grounding of Deformable Objects in Egocentric Organizing Scenes
Jia, Wanjun, Yang, Fan, Duan, Mengfei, Chen, Xianchi, Wang, Yinxi, Jiang, Yiming, Chen, Wenrui, Yang, Kailun, Li, Zhiyong
Deformable object manipulation in robotics presents significant challenges due to uncertainties in component properties, diverse configurations, visual interference, and ambiguous prompts. These factors complicate both perception and control tasks. To address these challenges, we propose a novel method for One-Shot Affordance Grounding of Deformable Objects (OS-AGDO) in egocentric organizing scenes, enabling robots to recognize previously unseen deformable objects with varying colors and shapes using minimal samples. Specifically, we first introduce the Deformable Object Semantic Enhancement Module (DefoSEM), which enhances hierarchical understanding of the internal structure and improves the ability to accurately identify local features, even under conditions of weak component information. Next, we propose the ORB-Enhanced Keypoint Fusion Module (OEKFM), which optimizes feature extraction of key components by leveraging geometric constraints and improves adaptability to diversity and visual interference. Additionally, we propose an instance-conditional prompt based on image data and task context, effectively mitigates the issue of region ambiguity caused by prompt words. To validate these methods, we construct a diverse real-world dataset, AGDDO15, which includes 15 common types of deformable objects and their associated organizational actions. Experimental results demonstrate that our approach significantly outperforms state-of-the-art methods, achieving improvements of 6.2%, 3.2%, and 2.9% in KLD, SIM, and NSS metrics, respectively, while exhibiting high generalization performance. Source code and benchmark dataset will be publicly available at https://github.com/Dikay1/OS-AGDO.
Asymptotic Theory of Eigenvectors for Latent Embeddings with Generalized Laplacian Matrices
Fan, Jianqing, Fan, Yingying, Lv, Jinchi, Yang, Fan, Yu, Diwen
Laplacian matrices are commonly employed in many real applications, encoding the underlying latent structural information such as graphs and manifolds. The use of the normalization terms naturally gives rise to random matrices with dependency. It is well-known that dependency is a major bottleneck of new random matrix theory (RMT) developments. To this end, in this paper, we formally introduce a class of generalized (and regularized) Laplacian matrices, which contains the Laplacian matrix and the random adjacency matrix as a specific case, and suggest the new framework of the asymptotic theory of eigenvectors for latent embeddings with generalized Laplacian matrices (ATE-GL). Our new theory is empowered by the tool of generalized quadratic vector equation for dealing with RMT under dependency, and delicate high-order asymptotic expansions of the empirical spiked eigenvectors and eigenvalues based on local laws. The asymptotic normalities established for both spiked eigenvectors and eigenvalues will enable us to conduct precise inference and uncertainty quantification for applications involving the generalized Laplacian matrices with flexibility. We discuss some applications of the suggested ATE-GL framework and showcase its validity through some numerical examples.
Multi-Keypoint Affordance Representation for Functional Dexterous Grasping
Yang, Fan, Luo, Dongsheng, Chen, Wenrui, Lin, Jiacheng, Cai, Junjie, Yang, Kailun, Li, Zhiyong, Wang, Yaonan
Functional dexterous grasping requires precise hand-object interaction, going beyond simple gripping. Existing affordance-based methods primarily predict coarse interaction regions and cannot directly constrain the grasping posture, leading to a disconnection between visual perception and manipulation. To address this issue, we propose a multi-keypoint affordance representation for functional dexterous grasping, which directly encodes task-driven grasp configurations by localizing functional contact points. Our method introduces Contact-guided Multi-Keypoint Affordance (CMKA), leveraging human grasping experience images for weak supervision combined with Large Vision Models for fine affordance feature extraction, achieving generalization while avoiding manual keypoint annotations. Additionally, we present a Keypoint-based Grasp matrix Transformation (KGT) method, ensuring spatial consistency between hand keypoints and object contact points, thus providing a direct link between visual perception and dexterous grasping actions. Experiments on public real-world FAH datasets, IsaacGym simulation, and challenging robotic tasks demonstrate that our method significantly improves affordance localization accuracy, grasp consistency, and generalization to unseen tools and tasks, bridging the gap between visual affordance learning and dexterous robotic manipulation. The source code and demo videos will be publicly available at https://github.com/PopeyePxx/MKA.
LongRoPE2: Near-Lossless LLM Context Window Scaling
Shang, Ning, Zhang, Li Lyna, Wang, Siyuan, Zhang, Gaokai, Lopez, Gilsinia, Yang, Fan, Chen, Weizhu, Yang, Mao
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.
Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning
Li, Zenan, Li, Zhaoyu, Tang, Wen, Zhang, Xian, Yao, Yuan, Si, Xujie, Yang, Fan, Yang, Kaiyu, Ma, Xiaoxing
Large language models (LLMs) can prove mathematical theorems formally by generating proof steps (\textit{a.k.a.} tactics) within a proof system. However, the space of possible tactics is vast and complex, while the available training data for formal proofs is limited, posing a significant challenge to LLM-based tactic generation. To address this, we introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods. The key aspect of this integration is identifying which parts of mathematical reasoning are best suited to LLMs and which to symbolic methods. While the high-level idea of neuro-symbolic integration is broadly applicable to various mathematical problems, in this paper, we focus specifically on Olympiad inequalities (Figure~1). We analyze how humans solve these problems and distill the techniques into two types of tactics: (1) scaling, handled by symbolic methods, and (2) rewriting, handled by LLMs. In addition, we combine symbolic tools with LLMs to prune and rank the proof goals for efficient proof search. We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data.