Li, Juntao
Stick to Facts: Towards Fidelity-oriented Product Description Generation
Chan, Zhangming, Chen, Xiuying, Wang, Yongliang, Li, Juntao, Zhang, Zhiqiang, Gai, Kun, Zhao, Dongyan, Yan, Rui
Different from other text generation tasks, in product description generation, it is of vital importance to generate faithful descriptions that stick to the product attribute information. However, little attention has been paid to this problem. To bridge this gap, we propose a model named Fidelity-oriented Product Description Generator (FPDG). FPDG takes the entity label of each word into account, since the product attribute information is always conveyed by entity words. Specifically, we first propose a Recurrent Neural Network (RNN) decoder based on the Entity-label-guided Long Short-Term Memory (ELSTM) cell, taking both the embedding and the entity label of each word as input. Second, we establish a keyword memory that stores the entity labels as keys and keywords as values, allowing FPDG to attend to keywords by attending to their entity labels. Experiments conducted on a large-scale real-world product description dataset show that our model achieves state-of-the-art performance in terms of both traditional generation metrics and human evaluations. Specifically, FPDG increases the fidelity of the generated descriptions by 25%.
XIFBench: Evaluating Large Language Models on Multilingual Instruction Following
Li, Zhenyu, Chen, Kehai, Long, Yunfei, Bai, Xuefeng, Zhang, Yaoyin, Wei, Xuchen, Li, Juntao, Zhang, Min
Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings remains poorly understood, as existing evaluations lack fine-grained constraint analysis. We introduce XIFBench, a comprehensive constraint-based benchmark for assessing multilingual instruction-following abilities of LLMs, featuring a novel taxonomy of five constraint categories and 465 parallel instructions across six languages spanning different resource levels. To ensure consistent cross-lingual evaluation, we develop a requirement-based protocol that leverages English requirements as semantic anchors. These requirements are then used to validate the translations across languages. Extensive experiments with various LLMs reveal notable variations in instruction-following performance across resource levels, identifying key influencing factors such as constraint categories, instruction complexity, and cultural specificity.
The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents
Duan, Yifan, Tang, Yihong, Bai, Xuefeng, Chen, Kehai, Li, Juntao, Zhang, Min
Large language models (LLMs) excel in both closed tasks (including problem-solving, and code generation) and open tasks (including creative writing), yet existing explanations for their capabilities lack connections to real-world human intelligence. To fill this gap, this paper systematically investigates LLM intelligence through the lens of ``human simulation'', addressing three core questions: (1) How do personality traits affect problem-solving in closed tasks? (2) How do traits shape creativity in open tasks? (3) How does single-agent performance influence multi-agent collaboration? By assigning Big Five personality traits to LLM agents and evaluating their performance in single- and multi-agent settings, we reveal that specific traits significantly influence reasoning accuracy (closed tasks) and creative output (open tasks). Furthermore, multi-agent systems exhibit collective intelligence distinct from individual capabilities, driven by distinguishing combinations of personalities. We demonstrate that LLMs inherently simulate human behavior through next-token prediction, mirroring human language, decision-making, and collaborative dynamics.
ASurvey: Spatiotemporal Consistency in Video Generation
Yin, Zhiyu, Chen, Kehai, Bai, Xuefeng, Jiang, Ruili, Li, Juntao, Li, Hongdong, Liu, Jin, Xiang, Yang, Yu, Jun, Zhang, Min
Video generation, by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC). Video generation presents unique challenges beyond static image generation, requiring both high-quality individual frames and temporal coherence to maintain consistency across the spatiotemporal sequence. Recent works have aimed at addressing the spatiotemporal consistency issue in video generation, while few literature review has been organized from this perspective. This gap hinders a deeper understanding of the underlying mechanisms for high-quality video generation. In this survey, we systematically review the recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics. We particularly focus on their contributions to maintaining spatiotemporal consistency. Finally, we discuss the future directions and challenges in this field, hoping to inspire further efforts to advance the development of video generation.
Revealing and Mitigating Over-Attention in Knowledge Editing
Wang, Pinzheng, Tang, Zecheng, Zhou, Keyan, Li, Juntao, Zhu, Qiaoming, Zhang, Min
Large Language Models have demonstrated superior performance across a wide range of tasks, but they still exhibit undesirable errors due to incorrect knowledge learned from the training data. To avoid this, knowledge editing methods emerged to precisely edit the specific model knowledge via efficiently modifying a very small percentage of parameters. % However, those methods can lead to the problem of Specificity Failure: when the content related to the edited knowledge occurs in the context, it can inadvertently corrupt other pre-existing knowledge. However, those methods can lead to the problem of Specificity Failure, where the existing knowledge and capabilities are severely degraded due to editing. Our preliminary indicates that Specificity Failure primarily stems from the model's attention heads assigning excessive attention scores to entities related to the edited knowledge, thereby unduly focusing on specific snippets within the context, which we denote as the Attention Drift phenomenon. To mitigate such Attention Drift issue, we introduce a simple yet effective method Selective Attention Drift Restriction}(SADR), which introduces an additional regularization term during the knowledge editing process to restrict changes in the attention weight distribution, thereby preventing undue focus on the edited entity. Experiments on five frequently used strong LLMs demonstrate the effectiveness of our method, where SADR can significantly mitigate Specificity Failure in the predominant knowledge editing tasks.
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Wang, Yue, Liu, Qiuzhi, Xu, Jiahao, Liang, Tian, Chen, Xingyu, He, Zhiwei, Song, Linfeng, Yu, Dian, Li, Juntao, Zhang, Zhuosheng, Wang, Rui, Tu, Zhaopeng, Mi, Haitao, Yu, Dong
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.
The Convergence of Dynamic Routing between Capsules
Ye, Daoyuan, Li, Juntao, Shen, Yiting
Capsule networks(CapsNet) are recently proposed neural network models with new processing layers, specifically for entity representation and discovery of images. It is well known that CapsNet have some advantages over traditional neural networks, especially in generalization capability. At the same time, some studies report negative experimental results. The causes of this contradiction have not been thoroughly analyzed. The preliminary experimental results show that the behavior of routing algorithms does not always produce good results as expected, and in most cases, different routing algorithms do not change the classification results, but simply polarize the link strength, especially when they continue to repeat without stopping. To realize the true potential of the CapsNet, deep mathematical analysis of the routing algorithms is crucial. In this paper, we will give the objective function that is minimized by the dynamic routing algorithm, which is a concave function. The dynamic routing algorithm can be regarded as nonlinear gradient method to solving an optimization algorithm under linear constraints, and its convergence can be strictly proved mathematically. Furthermore, the mathematically rigorous proof of the convergence is given for this class of iterative routing procedures. We analyze the relation between the objective function and the constraints solved by the dynamic routing algorithm in detail, and perform the corresponding routing experiment to analyze the effect of our convergence proof.
Test-time Computing: from System-1 Thinking to System-2 Thinking
Ji, Yixin, Li, Juntao, Ye, Hai, Wu, Kaixin, Xu, Jia, Mo, Linjian, Zhang, Min
The remarkable performance of the o1 model in complex reasoning demonstrates that test-time computing scaling can further unlock the model's potential, enabling powerful System-2 thinking. However, there is still a lack of comprehensive surveys for test-time computing scaling. We trace the concept of test-time computing back to System-1 models. In System-1 models, test-time computing addresses distribution shifts and improves robustness and generalization through parameter updating, input modification, representation editing, and output calibration. In System-2 models, it enhances the model's reasoning ability to solve complex problems through repeated sampling, self-correction, and tree search. We organize this survey according to the trend of System-1 to System-2 thinking, highlighting the key role of test-time computing in the transition from System-1 models to weak System-2 models, and then to strong System-2 models. We also point out a few possible future directions.
LOGO -- Long cOntext aliGnment via efficient preference Optimization
Tang, Zecheng, Sun, Zechen, Li, Juntao, Zhu, Qiaoming, Zhang, Min
Long-context models (LCMs) have shown great potential in processing long input sequences (even more than 100M tokens) conveniently and effectively. With significant progress, recent research has pointed out that LCMs can accurately locate token-level salient information within the context. Yet, the generation performance of these LCMs is far from satisfactory and might result in misaligned responses, such as hallucinations. To enhance the generation capability of LCMs, existing works have investigated the effects of data size and quality for both pretraining and instruction tuning. Though achieving meaningful improvement, previous methods fall short in either effectiveness or efficiency. In this paper, we introduce LOGO (Long cOntext aliGnment via efficient preference Optimization), a training strategy that first introduces preference optimization for long-context alignment. To overcome the GPU memory-bound issue caused by the long sequence, LOGO employs a reference-free preference optimization strategy and adopts a position synthesis method to construct the training data. By training with only 0.3B data on a single 8 A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct-80K model to achieve comparable performance with GPT-4 in real-world long-context tasks while preserving the model's original capabilities on other tasks, e.g., language modeling and MMLU. Moreover, LOGO can extend the model's context window size while enhancing its generation performance. Figure 1: (a) Performance of LCMs on real-world long-context tasks; (b) Retrieval score (longcontext understanding ability) and recall score (generation ability) of LCMs on the synthetic retrieval long-context task (multi-value NIAH); (c) Long-context (pre-)training data size for each LCM.
Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch
Ding, Yuyang, Shi, Xinyu, Liang, Xiaobo, Li, Juntao, Zhu, Qiaoming, Zhang, Min
The availability of high-quality data is one of the most important factors in improving the reasoning capability of LLMs. Existing works have demonstrated the effectiveness of creating more instruction data from seed questions or knowledge bases. Recent research indicates that continually scaling up data synthesis from strong models (e.g., GPT-4) can further elicit reasoning performance. Though promising, the open-sourced community still lacks high-quality data at scale and scalable data synthesis methods with affordable costs. To address this, we introduce ScaleQuest, a scalable and novel data synthesis method that utilizes "smallsize" (e.g., 7B) open-source models to generate questions from scratch without the need for seed data with complex augmentation constraints. With the efficient ScaleQuest, we automatically constructed a mathematical reasoning dataset consisting of 1 million problem-solution pairs, which are more effective than existing open-sourced datasets. It can universally increase the performance of mainstream open-source models (i.e., Mistral, Llama3, DeepSeekMath, and Qwen2-Math) by achieving 29.2% to 46.4% gains on MATH. Notably, simply fine-tuning the Qwen2-Math-7B-Base model with our dataset can even surpass Qwen2-Math-7B-Instruct, a strong and well-aligned model on closed-source data, and proprietary models such as GPT-4-Turbo and Claude-3.5 Right: Results of Llama3-8B fine-tuned on publicly available datasets constructed by different methods. Juntao Li is the corresponding author. How to improve the reasoning capabilities of Large Language Models (LLMs) has attracted significant attention. The success of recent advanced models, such as OpenAI o1 and Claude-3.5, However, the proprietary nature of the data presents a significant barrier to the open-source community.