Liu, Yongfei
FullStack Bench: Evaluating LLMs as Full Stack Coders
Bytedance-Seed-Foundation-Code-Team, null, :, null, Cheng, Yao, Chen, Jianfeng, Chen, Jie, Chen, Li, Chen, Liyu, Chen, Wentao, Chen, Zhengyu, Geng, Shijie, Li, Aoyan, Li, Bo, Li, Bowen, Li, Linyi, Liu, Boyi, Liu, Jerry, Liu, Kaibo, Liu, Qi, Liu, Shukai, Liu, Siyao, Liu, Tianyi, Liu, Tingkai, Liu, Yongfei, Long, Rui, Mai, Jing, Ning, Guanghan, Peng, Z. Y., Shen, Kai, Su, Jiahao, Su, Jing, Sun, Tao, Sun, Yifan, Tao, Yunzhe, Wang, Guoyin, Wang, Siwei, Wang, Xuwu, Wang, Yite, Wang, Zihan, Xia, Jinxiang, Xiang, Liang, Xiao, Xia, Xiao, Yongsheng, Xi, Chenguang, Xin, Shulin, Xu, Jingjing, Xu, Shikun, Yang, Hongxia, Yang, Jack, Yang, Yingxiang, Yuan, Jianbo, Zhang, Jun, Zhang, Yufeng, Zhang, Yuyu, Zheng, Shen, Zhu, He, Zhu, Ming
As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.
DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Liu, Zhihan, Zhang, Shenao, Liu, Yongfei, Liu, Boyi, Yang, Yingxiang, Wang, Zhaoran
Direct preference learning offers a promising and computation-efficient beyond supervised fine-tuning (SFT) for improving code generation in coding large language models (LMs). However, the scarcity of reliable preference data is a bottleneck for the performance of direct preference learning to improve the coding accuracy of code LMs. In this paper, we introduce \underline{\textbf{D}}irect Preference Learning with Only \underline{\textbf{S}}elf-Generated \underline{\textbf{T}}ests and \underline{\textbf{C}}ode (DSTC), a framework that leverages only self-generated code snippets and tests to construct reliable preference pairs such that direct preference learning can improve LM coding accuracy without external annotations. DSTC combines a minimax selection process and test-code concatenation to improve preference pair quality, reducing the influence of incorrect self-generated tests and enhancing model performance without the need for costly reward models. When applied with direct preference learning methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO), DSTC yields stable improvements in coding accuracy (pass@1 score) across diverse coding benchmarks, including HumanEval, MBPP, and BigCodeBench, demonstrating both its effectiveness and scalability for models of various sizes. This approach autonomously enhances code generation accuracy across LLMs of varying sizes, reducing reliance on expensive annotated coding datasets.
Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization
Xu, Ruijie, Liu, Zhihan, Liu, Yongfei, Yan, Shipeng, Wang, Zhaoran, Zhang, Zhi, He, Xuming
We address the challenge of online Reinforcement Learning from Human Feedback (RLHF) with a focus on self-rewarding alignment methods. In online RLHF, obtaining feedback requires interaction with the environment, which can be costly when using additional reward models or the GPT-4 API. Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities, which are effective for large-scale models but challenging to transfer to smaller ones. To address these limitations, we propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities. Additionally, we employ fine-grained arithmetic control over the optimality gap between positive and negative examples, generating more hard negatives in the later stages of training to help the model better capture subtle human preferences. Finally, we conduct extensive experiments on two base models, Mistral-7B and Mistral-Instruct-7B, which significantly bootstrap the performance of the reference model, achieving 34.5% in the Length-controlled Win Rates of AlpacaEval 2.0. Reinforcement Learning from Human Feedback (RLHF) is a prevalent technique for Large Language Model (LLM) alignment, ensuring models adhere to human preferences, produce useful and truthful responses, and prevent harmful ones (Stiennon et al., 2020; Ouyang et al., 2022; Christiano et al., 2017). Current RLHF methods are classified into online and offline approaches (Rafailov et al., 2024; Xiong et al., 2024; Meng et al., 2024).
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Wang, Yiqi, Chen, Wentao, Han, Xiaotian, Lin, Xudong, Zhao, Haiteng, Liu, Yongfei, Zhai, Bohan, Yuan, Jianbo, You, Quanzeng, Yang, Hongxia
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.
Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis
Chen, Xiaohui, Liu, Yongfei, Yang, Yingxiang, Yuan, Jianbo, You, Quanzeng, Liu, Li-Ping, Yang, Hongxia
Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts. Despite the advancement, these diffusion models sometimes struggle to translate the semantic content from the text into images entirely. While conditioning on the layout has shown to be effective in improving the compositional ability of T2I diffusion models, they typically require manual layout input. In this work, we introduce a novel approach to improving T2I diffusion models using Large Language Models (LLMs) as layout generators. Our method leverages the Chain-of-Thought prompting of LLMs to interpret text and generate spatially reasonable object layouts. The generated layout is then used to enhance the generated images' composition and spatial accuracy. Moreover, we propose an efficient adapter based on a cross-attention mechanism, which explicitly integrates the layout information into the stable diffusion models. Our experiments demonstrate significant improvements in image quality and layout accuracy, showcasing the potential of LLMs in augmenting generative image models.
Grounded Image Text Matching with Mismatched Relation Reasoning
Wu, Yu, Wei, Yana, Wang, Haozhe, Liu, Yongfei, Yang, Sibei, He, Xuming
This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating pre-trained models on this task, with a focus on the challenging settings of limited data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained models lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. RCRN can be interpreted as a modular program and delivers strong performance in both length generalization and data efficiency.