AITopics | Bu, Xingyuan

Collaborating Authors

Bu, Xingyuan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Team, M-A-P, Du, Xinrun, Yao, Yifan, Ma, Kaijing, Wang, Bingli, Zheng, Tianyu, Zhu, Kang, Liu, Minghao, Liang, Yiming, Jin, Xiaolong, Wei, Zhenlin, Zheng, Chujie, Deng, Kaixin, Jia, Shian, Jiang, Sichao, Liao, Yiyan, Li, Rui, Li, Qinrui, Li, Sirun, Li, Yizhi, Li, Yunwen, Ma, Dehua, Ni, Yuansheng, Que, Haoran, Wang, Qiyao, Wen, Zhoufutu, Wu, Siwei, Xing, Tianshun, Xu, Ming, Yang, Zhenzhu, Wang, Zekun Moore, Zhou, Junting, Bai, Yuelin, Bu, Xingyuan, Cai, Chenglin, Chen, Liang, Chen, Yifan, Cheng, Chengtuo, Cheng, Tianhao, Ding, Keyi, Huang, Siming, Huang, Yun, Li, Yaoru, Li, Yizhe, Li, Zhaoqun, Liang, Tianhao, Lin, Chengdong, Lin, Hongquan, Ma, Yinghao, Pang, Tianyang, Peng, Zhongyuan, Peng, Zifan, Qi, Qige, Qiu, Shi, Qu, Xingwei, Quan, Shanghaoran, Tan, Yizhou, Wang, Zili, Wang, Chenqing, Wang, Hao, Wang, Yiya, Wang, Yubo, Xu, Jiajun, Yang, Kexin, Yuan, Ruibin, Yue, Yuanhao, Zhan, Tianyang, Zhang, Chun, Zhang, Jinyang, Zhang, Xiyue, Zhang, Xingjian, Zhang, Yue, Zhao, Yongchi, Zheng, Xiangyu, Zhong, Chenghua, Gao, Yang, Li, Zhoujun, Liu, Dayiheng, Liu, Qian, Liu, Tianyu, Ni, Shiwen, Peng, Junran, Qin, Yujia, Su, Wenbo, Wang, Guoyin, Wang, Shi, Yang, Jian, Yang, Min, Cao, Meng, Yue, Xiang, Zhang, Zhaoxiang, Zhou, Wangchunshu, Liu, Jiaheng, Lin, Qunshu, Huang, Wenhao, Zhang, Ge

arXiv.org Artificial IntelligenceMar-4-2025

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.14739

Country:

North America > United States > Michigan > Isabella County (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Guangdong Province (0.14)
Africa > Cameroon > Gulf of Guinea (0.14)

Genre: Research Report (0.82)

Industry:

Law (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Energy > Oil & Gas > Upstream (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

He, Yancheng, Li, Shilong, Liu, Jiaheng, Wang, Weixun, Bu, Xingyuan, Zhang, Ge, Peng, Zhongyuan, Zhang, Zhaoxiang, Zheng, Zhicheng, Su, Wenbo, Zheng, Bo

arXiv.org Artificial IntelligenceFeb-27-2025

Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.19361

Country: Asia (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models

Tan, Yingshui, Jiang, Yilei, Li, Yanshi, Liu, Jiaheng, Bu, Xingyuan, Su, Wenbo, Yue, Xiangyu, Zhu, Xiaoyong, Zheng, Bo

arXiv.org Artificial IntelligenceFeb-17-2025

Fine-tuning large language models (LLMs) based on human preferences, commonly achieved through reinforcement learning from human feedback (RLHF), has been effective in improving their performance. However, maintaining LLM safety throughout the fine-tuning process remains a significant challenge, as resolving conflicts between safety and helpfulness can be non-trivial. Typically, the safety alignment of LLM is trained on data with safety-related categories. However, our experiments find that naively increasing the scale of safety training data usually leads the LLMs to an ``overly safe'' state rather than a ``truly safe'' state, boosting the refusal rate through extensive safety-aligned data without genuinely understanding the requirements for safe responses. Such an approach can inadvertently diminish the models' helpfulness. To understand the phenomenon, we first investigate the role of safety data by categorizing them into three different groups, and observe that each group behaves differently as training data scales up. To boost the balance between safety and helpfulness, we propose an Equilibrate RLHF framework including a Fine-grained Data-centric (FDC) approach that achieves better safety alignment even with fewer training data, and an Adaptive Message-wise Alignment (AMA) approach, which selectively highlight the key segments through a gradient masking strategy. Extensive experimental results demonstrate that our approach significantly enhances the safety alignment of LLMs while balancing safety and helpfulness.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2502.11555

Country: Asia > China (0.46)

Genre: Research Report > New Finding (0.66)

Industry:

Law > Criminal Law (0.93)
Health & Medicine > Consumer Health (0.93)
Health & Medicine > Therapeutic Area (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment

Li, Yanshi, Xiong, Shaopan, Chen, Gengru, Li, Xiaoyang, Luo, Yijia, Zhang, Xingyao, Huang, Yanhui, Bu, Xingyuan, Tan, Yingshui, Yuan, Chun, Wang, Jiamang, Su, Wenbo, Zheng, Bo

arXiv.org Artificial IntelligenceDec-4-2024

Reinforcement Learning from Human Feedback (RLHF) has proven highly effective in aligning Large Language Models (LLMs) with human preferences. However, the original RLHF typically optimizes under an overall reward, which can lead to a suboptimal learning process. This limitation stems from RLHF's lack of awareness regarding which specific tokens should be reinforced or suppressed. Moreover, conflicts in supervision can arise, for instance, when a chosen response includes erroneous tokens, while a rejected response contains accurate elements. To rectify these shortcomings, increasing dense reward methods, such as step-wise and token-wise RLHF, have been proposed. However, these existing methods are limited to specific tasks (like mathematics). In this paper, we propose the "Adaptive Message-wise RLHF" method, which robustly applies to various tasks. By defining pivot tokens as key indicators, our approach adaptively identifies essential information and converts sequence-level supervision into finegrained, subsequence-level supervision. Experiments demonstrate that our method can be integrated into various training methods, significantly mitigating hallucinations and catastrophic forgetting problems, while outperforming other methods on multiple evaluation metrics. Our method improves the success rate on adversarial samples by 10% compared to the samplewise approach, and achieves a 1.3% improvement on evaluation benchmarks such as MMLU, GSM8K, HumanEval, etc. In recent years, generative AI models have made significant achievements, with preference alignment by reinforcement learning playing an essential role in this progress (Ouyang et al., 2022; Touvron et al., 2023; Rafailov et al., 2024; Dubey et al., 2024; Yang et al., 2024a; OpenAI et al., 2024).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.00809

Genre: Research Report > New Finding (0.46)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.54)

Add feedback

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

He, Yancheng, Li, Shilong, Liu, Jiaheng, Tan, Yingshui, Wang, Weixun, Huang, Hui, Bu, Xingyuan, Guo, Hangyu, Hu, Chengwei, Zheng, Boren, Lin, Zhuoran, Liu, Xuepeng, Sun, Dekai, Lin, Shirong, Zheng, Zhicheng, Zhu, Xiaoyong, Su, Wenbo, Zheng, Bo

arXiv.org Artificial IntelligenceNov-13-2024

New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of foundation models.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.0714

Country: Asia (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)

Add feedback

2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Li, Shilong, He, Yancheng, Huang, Hui, Bu, Xingyuan, Liu, Jiaheng, Guo, Hangyu, Wang, Weixun, Gu, Jihao, Su, Wenbo, Zheng, Bo

arXiv.org Artificial IntelligenceOct-25-2024

Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of Large Language Models (LLMs) with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.1972

Genre: Research Report (1.00)

Industry: Health & Medicine > Consumer Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

Bai, Ge, Liu, Jie, Bu, Xingyuan, He, Yancheng, Liu, Jiaheng, Zhou, Zhanhui, Lin, Zhuoran, Su, Wenbo, Ge, Tiezheng, Zheng, Bo, Ouyang, Wanli

arXiv.org Artificial IntelligenceJun-25-2024

The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities. The data and code are available at \url{https://github.com/mtbench101/mt-bench-101}.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2402.14762

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.68)
Media > Film (0.46)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Liu, Jie, Zhou, Zhanhui, Liu, Jiaheng, Bu, Xingyuan, Yang, Chao, Zhong, Han-Sen, Ouyang, Wanli

arXiv.org Artificial IntelligenceJun-17-2024

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled win rate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.11817

Genre: Research Report > New Finding (1.00)

Industry: Education (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models

Wu, Yanan, Liu, Jie, Bu, Xingyuan, Liu, Jiaheng, Zhou, Zhanhui, Zhang, Yuanxing, Zhang, Chenchen, Bai, Zhiqi, Chen, Haibin, Ge, Tiezheng, Ouyang, Wanli, Su, Wenbo, Zheng, Bo

arXiv.org Artificial IntelligenceFeb-23-2024

This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models.

conceptmath, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2402.1466

Country: North America > United States > Maryland (0.14)

Genre: Research Report (0.40)

Industry:

Education (1.00)
Government > Tax (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback