Baichuan-Omni-1.5 Technical Report
Li, Yadong, Liu, Jun, Zhang, Tao, Zhang, Tao, Chen, Song, Li, Tianpeng, Li, Zehuan, Liu, Lijun, Ming, Lingfeng, Dong, Guosheng, Pan, Da, Li, Chong, Fang, Yuanbo, Kuang, Dongdong, Wang, Mingrui, Zhu, Chenglin, Zhang, Youwei, Guo, Hongyu, Zhang, Fengyu, Wang, Yuran, Ding, Bowen, Song, Wei, Li, Xu, Huo, Yuqi, Liang, Zheng, Zhang, Shusen, Wu, Xin, Zhao, Shuai, Xiong, Linchu, Wu, Yozhen, Ye, Jiahui, Lu, Wenhao, Li, Bowen, Zhang, Yan, Zhou, Yaqi, Chen, Xin, Su, Lei, Zhang, Hongda, Chen, Fuzhong, Dong, Xuezhen, Nie, Na, Wu, Zhiying, Xiao, Bin, Li, Ting, Dang, Shunya, Zhang, Ping, Sun, Yijia, Wu, Jincheng, Yang, Jinjie, Lin, Xionghai, Ma, Zhi, Wu, Kegeng, li, Jia, Yang, Aiyuan, Liu, Hui, Zhang, Jianqiang, Chen, Xiaoxi, Ai, Guangwei, Zhang, Wentao, Chen, Yicong, Huang, Xiaoqin, Li, Kun, Luo, Wenjing, Duan, Yifei, Zhu, Lingling, Xiao, Ran, Su, Zhe, Pu, Jiani, Wang, Dian, Jia, Xu, Zhang, Tianyu, Ai, Mengyu, Wang, Mang, Qiao, Yujing, Zhang, Lei, Shen, Yanjun, Yang, Fan, Zhen, Miao, Zhou, Yijie, Chen, Mingyang, Li, Fei, Zhu, Chenzheng, Lu, Keer, Zhao, Yaqi, Liang, Hao, Li, Youquan, Qin, Yanzhao, Sun, Linzhuang, Xu, Jianhua, Sun, Haoze, Lin, Mingan, Zhou, Zenan, Chen, Weipeng
–arXiv.org Artificial Intelligence
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
arXiv.org Artificial Intelligence
Jan-25-2025
- Country:
- Asia > Thailand (0.14)
- Europe > Switzerland (0.14)
- Genre:
- Research Report (0.63)
- Industry:
- Education (0.93)
- Health & Medicine
- Diagnostic Medicine > Imaging (0.68)
- Therapeutic Area (1.00)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Natural Language
- Chatbot (1.00)
- Large Language Model (1.00)
- Representation & Reasoning (1.00)
- Speech > Speech Recognition (0.93)
- Vision (1.00)
- Machine Learning > Neural Networks
- Data Science (0.85)
- Sensing and Signal Processing > Image Processing (0.93)
- Artificial Intelligence
- Information Technology