Chen, Weipeng
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Chen, Mingyang, Li, Tianpeng, Sun, Haoze, Zhou, Yijie, Zhu, Chenzheng, Wang, Haofen, Pan, Jeff Z., Zhang, Wen, Chen, Huajun, Yang, Fan, Zhou, Zenan, Chen, Weipeng
Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
Song, Wei, Wang, Yuran, Song, Zijia, Li, Yadong, Sun, Haoze, Chen, Weipeng, Zhou, Zenan, Xu, Jianhua, Wang, Jiaqi, Yu, Kaicheng
The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Li, Tianpeng, Liu, Jun, Zhang, Tao, Fang, Yuanbo, Pan, Da, Wang, Mingrui, Liang, Zheng, Li, Zehuan, Lin, Mingan, Dong, Guosheng, Xu, Jianhua, Sun, Haoze, Zhou, Zenan, Chen, Weipeng
We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio
Baichuan-M1: Pushing the Medical Capability of Large Language Models
Wang, Bingning, Zhao, Haizhou, Zhou, Huozhi, Song, Liang, Xu, Mingyu, Cheng, Wei, Zeng, Xiangrong, Zhang, Yupeng, Huo, Yuqi, Wang, Zecheng, Zhao, Zhengyun, Pan, Da, Yang, Fan, Kou, Fei, Li, Fei, Chen, Fuzhong, Dong, Guosheng, Liu, Han, Zhang, Hongda, He, Jin, Yang, Jinjie, Wu, Kangxi, Wu, Kegeng, Su, Lei, Niu, Linlin, Sun, Linzhuang, Wang, Mang, Fan, Pengcheng, Shen, Qianli, Xin, Rihui, Dang, Shunya, Zhou, Songchi, Chen, Weipeng, Luo, Wenjing, Chen, Xin, Men, Xin, Lin, Xionghai, Dong, Xuezhen, Zhang, Yan, Duan, Yifei, Zhou, Yuyan, Ma, Zhi, Wu, Zhiying
The current generation of large language models (LLMs) is typically designed for broad, general-purpose applications, while domain-specific LLMs, especially in vertical fields like medicine, remain relatively scarce. In particular, the development of highly efficient and practical LLMs for the medical domain is challenging due to the complexity of medical knowledge and the limited availability of high-quality data. To bridge this gap, we introduce Baichuan-M1, a series of large language models specifically optimized for medical applications. Unlike traditional approaches that simply continue pretraining on existing models or apply post-training to a general base model, Baichuan-M1 is trained from scratch with a dedicated focus on enhancing medical capabilities. Our model is trained on 20 trillion tokens and incorporates a range of effective training methods that strike a balance between general capabilities and medical expertise. As a result, Baichuan-M1 not only performs strongly across general domains such as mathematics and coding but also excels in specialized medical fields. We have open-sourced Baichuan-M1-14B, a mini version of our model, which can be accessed through the following links.
LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
Dong, Zican, Li, Junyi, Jiang, Jinhao, Xu, Mingyu, Zhao, Wayne Xin, Wang, Bingning, Chen, Weipeng
Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines.
Baichuan-Omni-1.5 Technical Report
Li, Yadong, Liu, Jun, Zhang, Tao, Zhang, Tao, Chen, Song, Li, Tianpeng, Li, Zehuan, Liu, Lijun, Ming, Lingfeng, Dong, Guosheng, Pan, Da, Li, Chong, Fang, Yuanbo, Kuang, Dongdong, Wang, Mingrui, Zhu, Chenglin, Zhang, Youwei, Guo, Hongyu, Zhang, Fengyu, Wang, Yuran, Ding, Bowen, Song, Wei, Li, Xu, Huo, Yuqi, Liang, Zheng, Zhang, Shusen, Wu, Xin, Zhao, Shuai, Xiong, Linchu, Wu, Yozhen, Ye, Jiahui, Lu, Wenhao, Li, Bowen, Zhang, Yan, Zhou, Yaqi, Chen, Xin, Su, Lei, Zhang, Hongda, Chen, Fuzhong, Dong, Xuezhen, Nie, Na, Wu, Zhiying, Xiao, Bin, Li, Ting, Dang, Shunya, Zhang, Ping, Sun, Yijia, Wu, Jincheng, Yang, Jinjie, Lin, Xionghai, Ma, Zhi, Wu, Kegeng, li, Jia, Yang, Aiyuan, Liu, Hui, Zhang, Jianqiang, Chen, Xiaoxi, Ai, Guangwei, Zhang, Wentao, Chen, Yicong, Huang, Xiaoqin, Li, Kun, Luo, Wenjing, Duan, Yifei, Zhu, Lingling, Xiao, Ran, Su, Zhe, Pu, Jiani, Wang, Dian, Jia, Xu, Zhang, Tianyu, Ai, Mengyu, Wang, Mang, Qiao, Yujing, Zhang, Lei, Shen, Yanjun, Yang, Fan, Zhen, Miao, Zhou, Yijie, Chen, Mingyang, Li, Fei, Zhu, Chenzheng, Lu, Keer, Zhao, Yaqi, Liang, Hao, Li, Youquan, Qin, Yanzhao, Sun, Linzhuang, Xu, Jianhua, Sun, Haoze, Lin, Mingan, Zhou, Zenan, Chen, Weipeng
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
Med-R$^2$: Crafting Trustworthy LLM Physicians through Retrieval and Reasoning of Evidence-Based Medicine
Lu, Keer, Liang, Zheng, Pan, Da, Zhang, Shusen, Wu, Xin, Chen, Weipeng, Zhou, Zenan, Dong, Guosheng, Cui, Bin, Zhang, Wentao
In recent years, Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. However, despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2 achieves a 14.87\% improvement over vanilla RAG methods and even a 3.59\% enhancement compared to fine-tuning strategies, without incurring additional training costs.
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Du, Yifan, Liu, Zikang, Li, Yifan, Zhao, Wayne Xin, Huo, Yuqi, Wang, Bingning, Chen, Weipeng, Liu, Zheng, Wang, Zhongyuan, Wen, Ji-Rong
Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.
Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback
Ji, Jiaming, Zhou, Jiayi, Lou, Hantao, Chen, Boyuan, Hong, Donghai, Wang, Xuyao, Chen, Wenqi, Wang, Kaile, Pan, Rui, Li, Jiahao, Wang, Mohan, Dai, Josef, Qiu, Tianyi, Xu, Hua, Li, Dong, Chen, Weipeng, Song, Jun, Zheng, Bo, Yang, Yaodong
Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the instruction-following capabilities of large language models; however, it remains underexplored in the cross-modality domain. As the number of modalities increases, aligning all-modality models with human intentions -- such as instruction following -- becomes a pressing challenge. In this work, we make the first attempt to fine-tune all-modality models (i.e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions. This endeavor presents several challenges. First, there is no large-scale all-modality human preference data in existing open-source resources, as most datasets are limited to specific modalities, predominantly text and image. Secondly, the effectiveness of binary preferences in RLHF for post-training alignment in complex all-modality scenarios remains an unexplored area. Finally, there is a lack of a systematic framework to evaluate the capabilities of all-modality models, particularly regarding modality selection and synergy. To address these challenges, we propose the align-anything framework, which includes meticulously annotated 200k all-modality human preference data. Then, we introduce an alignment method that learns from unified language feedback, effectively capturing complex modality-specific human preferences and enhancing the model's instruction-following capabilities. Furthermore, to assess performance improvements in all-modality models after post-training alignment, we construct a challenging all-modality capability evaluation framework -- eval-anything. All data, models, and code frameworks have been open-sourced for the community. For more details, please refer to https://github.com/PKU-Alignment/align-anything.
Baichuan-Omni Technical Report
Li, Yadong, Sun, Haoze, Lin, Mingan, Li, Tianpeng, Dong, Guosheng, Zhang, Tao, Ding, Bowen, Song, Wei, Cheng, Zhenglin, Huo, Yuqi, Chen, Song, Li, Xu, Pan, Da, Zhang, Shusen, Wu, Xin, Liang, Zheng, Liu, Jun, Zhang, Tao, Lu, Keer, Zhao, Yaqi, Shen, Yanjun, Yang, Fan, Yu, Kaicheng, Lin, Tao, Xu, Jianhua, Zhou, Zenan, Chen, Weipeng
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.