Xiong, Jing
SAS: Simulated Attention Score
Zheng, Chuanyang, Sun, Jiankai, Gao, Yihang, Wang, Yuehao, Wang, Peihao, Xiong, Jing, Ren, Liliang, Cheng, Hao, Kulkarni, Janardhan, Shen, Yelong, Wang, Atlas, Schwager, Mac, Schneider, Anderson, Liu, Xiaodong, Gao, Jianfeng
The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.
- North America > United States (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Emotion and Intention Guided Multi-Modal Learning for Sticker Response Selection
Hu, Yuxuan, Chen, Jian, Wang, Yuhao, Li, Zixuan, Xiong, Jing, Jia, Pengyue, Wang, Wei, Li, Chengming, Zhao, Xiangyu
Stickers are widely used in online communication to convey emotions and implicit intentions. The Sticker Response Selection (SRS) task aims to select the most contextually appropriate sticker based on the dialogue. However, existing methods typically rely on semantic matching and model emotional and intentional cues separately, which can lead to mismatches when emotions and intentions are misaligned. To address this issue, we propose Emotion and Intention Guided Multi-Modal Learning (EIGML). This framework is the first to jointly model emotion and intention, effectively reducing the bias caused by isolated modeling and significantly improving selection accuracy. Specifically, we introduce Dual-Level Contrastive Framework to perform both intra-modality and inter-modality alignment, ensuring consistent representation of emotional and intentional features within and across modalities. In addition, we design an Intention-Emotion Guided Multi-Modal Fusion module that integrates emotional and intentional information progressively through three components: Emotion-Guided Intention Knowledge Selection, Intention-Emotion Guided Attention Fusion, and Similarity-Adjusted Matching Mechanism. This design injects rich, effective information into the model and enables a deeper understanding of the dialogue, ultimately enhancing sticker selection performance. Experimental results on two public SRS datasets show that EIGML consistently outperforms state-of-the-art baselines, achieving higher accuracy and a better understanding of emotional and intentional features. Code is provided in the supplementary materials.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
Zheng, Chuanyang, Sun, Jiankai, Gao, Yihang, Xie, Enze, Wang, Yuehao, Wang, Peihao, Xu, Ting, Chang, Matthew, Ren, Liliang, Li, Jingyao, Xiong, Jing, Rasul, Kashif, Schwager, Mac, Schneider, Anderson, Wang, Zhangyang, Nevmyvaka, Yuriy
Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Jordan (0.04)
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
- (7 more...)
- Information Technology > Communications > Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation
Fu, Yuhui, Xie, Feiyang, Xu, Chaoyi, Xiong, Jing, Yuan, Haoqi, Lu, Zongqing
Loco-manipulation is a fundamental challenge for humanoid robots to achieve versatile interactions in human environments. Although recent studies have made significant progress in humanoid whole-body control, loco-manipulation remains underexplored and often relies on hard-coded task definitions or costly real-world data collection, which limits autonomy and generalization. We present DemoHLM, a framework for humanoid loco-manipulation that enables generalizable loco-manipulation on a real humanoid robot from a single demonstration in simulation. DemoHLM adopts a hierarchy that integrates a low-level universal whole-body controller with high-level manipulation policies for multiple tasks. The whole-body controller maps whole-body motion commands to joint torques and provides omnidirectional mobility for the humanoid robot. The manipulation policies, learned in simulation via our data generation and imitation learning pipeline, command the whole-body controller with closed-loop visual feedback to execute challenging loco-manipulation tasks. Experiments show a positive correlation between the amount of synthetic data and policy performance, underscoring the effectiveness of our data generation pipeline and the data efficiency of our approach. Real-world experiments on a Unitree G1 robot equipped with an RGB-D camera validate the sim-to-real transferability of DemoHLM, demonstrating robust performance under spatial variations across ten loco-manipulation tasks.
- Leisure & Entertainment > Games > Computer Games (0.34)
- Energy (0.34)
Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation
Zhang, Hengyuan, Yang, Shiping, Liang, Xiao, Shang, Chenming, Jiang, Yuxuan, Tao, Chaofan, Xiong, Jing, So, Hayden Kwok-Hay, Xie, Ruobing, Chang, Angel X., Wong, Ngai
Training student models on synthetic data generated by strong teacher models is a promising way to distilling the capabilities of teachers. However, recent studies show that stronger models are not always optimal teachers, revealing a mismatch between teacher outputs and student learnability. To address this issue, we propose PerSyn (Personalized data Synthesis), a novel synthesis strategy that operates under a new ``Route then Generate'' paradigm to create data tailored to each student model, enabling it to learn more effectively. Specifically, PerSyn first assigns each prompt to its optimal teacher via a query-level router that jointly considers student learnability and teacher response quality. Each teacher then synthesizes data only for its assigned prompts, making the process more efficient than the conventional ``Generate then Select'' paradigm, where all teachers must generate parallel responses for the entire prompt set before constructing the final dataset. Extensive experiments across different model families and scales demonstrate that PerSyn consistently achieves superior or comparable performance to all baselines in instruct tuning and math reasoning settings. Further analysis verifies the effectiveness of PerSyn and offers extra insights to propel future research.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- (7 more...)
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
Wan, Zhongwei, Dou, Zhihao, Liu, Che, Zhang, Yu, Cui, Dongfei, Zhao, Qinjian, Shen, Hui, Xiong, Jing, Xin, Yi, Jiang, Yifan, Tao, Chaofan, He, Yangfan, Zhang, Mi, Yan, Shen
Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.
- North America > United States > California (0.14)
- North America > United States > Ohio (0.04)
- North America > United States > Minnesota (0.04)
- (4 more...)
ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
Xiong, Jing, Chen, Qiujiang, Ye, Fanghua, Wan, Zhongwei, Zheng, Chuanyang, Zhao, Chenyang, Shen, Hui, Li, Alexander Hanbo, Tao, Chaofan, Tan, Haochen, Bai, Haoli, Shang, Lifeng, Kong, Lingpeng, Wong, Ngai
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency. Speculative decoding is a natural way to accelerate the scaling process; however, scaling along both the parallel and sequential dimensions poses significant challenges, including substantial memory-bound execution and synchronization overhead. We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework that follows the hypothesis testing process to address these challenges. By revisiting arithmetic intensity, ATTS identifies synchronization as the primary bottleneck. It enables asynchronous inference through online calibration and proposes an ordinal classification algorithm that supports a three-stage rejection sampling pipeline, scaling along both the sequential and parallel axes. Across experiments on the MA TH, AMC23, AIME24, and AIME25 datasets and across multiple draft-target model families, we show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement, while maintaining accurate control of the rejection rate, reducing latency and memory overhead, and incurring no accuracy loss. By scaling both in parallel and sequential dimensions, we enable the 1.5B/70B draft/target model combination to achieve the performance of the state-of-the-art reasoning model o3-mini (high) on the AIME dataset. With the rapid advances in large language models (LLMs), attention is increasingly turning to reasoning models (Guo et al., 2025; Muennighoff et al., 2025; McCoy et al., 2024; Shao et al., 2024)--systems that transcend next-token prediction in order to emulate human-like reasoning behaviors.
- North America > United States (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Hong Kong (0.04)
CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction
Li, Zixuan, Geng, Binzong, Xiong, Jing, He, Yong, Hu, Yuxuan, Chen, Jian, Chen, Dingwei, Chang, Xiyu, Zhang, Liang, Mo, Linjian, Li, Chengming, Yuan, Chuan, Sun, Zhenan
Click-Through Rate (CTR) prediction, a core task in recommendation systems, estimates user click likelihood using historical behavioral data. Modeling user behavior sequences as text to leverage Language Models (LMs) for this task has gained traction, owing to LMs' strong semantic understanding and contextual modeling capabilities. However, a critical structural gap exists: user behavior sequences consist of discrete actions connected by semantically empty separators, differing fundamentally from the coherent natural language in LM pre-training. This mismatch causes semantic fragmentation, where LM attention scatters across irrelevant tokens instead of focusing on meaningful behavior boundaries and inter-behavior relationships, degrading prediction performance. To address this, we propose $\textit{CTR-Sink}$, a novel framework introducing behavior-level attention sinks tailored for recommendation scenarios. Inspired by attention sink theory, it constructs attention focus sinks and dynamically regulates attention aggregation via external information. Specifically, we insert sink tokens between consecutive behaviors, incorporating recommendation-specific signals such as temporal distance to serve as stable attention sinks. To enhance generality, we design a two-stage training strategy that explicitly guides LM attention toward sink tokens and a attention sink mechanism that amplifies inter-sink dependencies to better capture behavioral correlations. Experiments on one industrial dataset and two open-source datasets (MovieLens, Kuairec), alongside visualization results, validate the method's effectiveness across scenarios.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.49)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.47)
Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS
Dou, Alex ZH, Wan, Zhongwei, Cui, Dongfei, Wang, Xin, Xiong, Jing, Lin, Haokun, Tao, Chaofan, Yan, Shen, Zhang, Mi
Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the coarse level, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement with an increase of up to 16% using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of our approach in complex reasoning tasks.
- Asia > Middle East > Jordan (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > Ohio (0.04)
Mathesis: Towards Formal Theorem Proving from Natural Languages
Xuejun, Yu, Zhong, Jianyuan, Feng, Zijin, Zhai, Pengyi, Yousefzadeh, Roozbeh, Ng, Wei Chong, Liu, Haoxiong, Shou, Ziyi, Xiong, Jing, Zhou, Yudong, Ong, Claudia Beth, Sugiarto, Austen Jeremy, Zhang, Yaoxi, Tai, Wai Ming, Cao, Huan, Lu, Dongcai, Sun, Jiacheng, Xu, Qiang, Xin, Shen, Li, Zhenguo
Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex problems from China's national college entrance exam. Our approach is carefully designed, with a thorough study of each component. Experiments demonstrate Mathesis's effectiveness, with the autoformalizer outperforming the best baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a state-of-the-art 18% on Gaokao-Formal.