Liang, Yiming
MuPT: A Generative Symbolic Music Pretrained Transformer
Qu, Xingwei, Bai, Yuelin, Ma, Yinghao, Zhou, Ziya, Lo, Ka Man, Liu, Jiaheng, Yuan, Ruibin, Min, Lejun, Liu, Xueling, Zhang, Tianyu, Du, Xinrun, Guo, Shuyue, Liang, Yiming, Li, Yizhi, Wu, Shangda, Zhou, Junting, Zheng, Tianyu, Ma, Ziyang, Han, Fengze, Xue, Wei, Xia, Gus, Benetos, Emmanouil, Yue, Xiang, Lin, Chenghua, Tan, Xu, Huang, Stephen W., Chen, Wenhu, Fu, Jie, Zhang, Ge
In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Bai, Yuelin, Du, Xinrun, Liang, Yiming, Jin, Yonggang, Liu, Ziqiang, Zhou, Junting, Zheng, Tianyu, Zhang, Xincheng, Ma, Nuo, Wang, Zekun, Yuan, Ruibin, Wu, Haihong, Lin, Hongquan, Huang, Wenhao, Zhang, Jiajun, Chen, Wenhu, Lin, Chenghua, Fu, Jie, Yang, Min, Ni, Shiwen, Zhang, Ge
Recently, there have been significant advancements in large language models (LLMs), particularly focused on the English language. These advancements have enabled these LLMs to understand and execute complex instructions with unprecedented accuracy and fluency. However, despite these advancements, there remains a noticeable gap in the development of Chinese instruction tuning. The unique linguistic features and cultural depth of the Chinese language pose challenges for instruction tuning tasks. Existing datasets are either derived from English-centric LLMs or are ill-suited for aligning with the interaction patterns of real-world Chinese users. To bridge this gap, we introduce COIG-CQIA, a high-quality Chinese instruction tuning dataset. Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions. To this end, we collect a high-quality human-written corpus from various sources on the Chinese Internet, including Q&A communities, Wikis, examinations, and existing NLP datasets. This corpus was rigorously filtered and carefully processed to form the COIG-CQIA dataset. Furthermore, we train models of various scales on different subsets of CQIA, following in-depth evaluation and analyses. The findings from our experiments offer valuable insights for selecting and developing Chinese instruction-tuning datasets. We also find that models trained on CQIA-Subset achieve competitive results in human assessment as well as knowledge and security benchmarks. Data are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Yuan, Ruibin, Lin, Hanfeng, Wang, Yi, Tian, Zeyue, Wu, Shangda, Shen, Tianhao, Zhang, Ge, Wu, Yuhang, Liu, Cong, Zhou, Ziya, Ma, Ziyang, Xue, Liumeng, Wang, Ziyu, Liu, Qin, Zheng, Tianyu, Li, Yizhi, Ma, Yinghao, Liang, Yiming, Chi, Xiaowei, Liu, Ruibo, Wang, Zili, Li, Pengfei, Wu, Jingcheng, Lin, Chenghua, Liu, Qifeng, Jiang, Tao, Huang, Wenhao, Chen, Wenhu, Benetos, Emmanouil, Fu, Jie, Xia, Gus, Dannenberg, Roger, Xue, Wei, Kang, Shiyin, Guo, Yike
While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval
Wu, Siwei, Li, Yizhi, Zhu, Kang, Zhang, Ge, Liang, Yiming, Ma, Kaijing, Xiao, Chenghao, Zhang, Haoran, Yang, Bohao, Chen, Wenhu, Huang, Wenhao, Moubayed, Noura Al, Fu, Jie, Lin, Chenghua
Multi-modal information retrieval (MMIR) is a rapidly evolving field, where significant progress, particularly in image-text pairing, has been made through advanced representation learning and cross-modality alignment research. However, current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap, where chart and table images described in scholarly language usually do not play a significant role. To bridge this gap, we develop a specialised scientific MMIR (SciMMIR) benchmark by leveraging open-access paper collections to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents. We further annotate the image-text pairs with two-level subset-subcategory hierarchy annotations to facilitate a more comprehensive evaluation of the baselines. We conducted zero-shot and fine-tuning evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP and BLIP. Our analysis offers critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the influence of the visual and textual encoders. All our data and checkpoints are publicly available at https://github.com/Wusiwei0410/SciMMIR.
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Zhang, Ge, Du, Xinrun, Chen, Bei, Liang, Yiming, Luo, Tongxu, Zheng, Tianyu, Zhu, Kang, Cheng, Yuyang, Xu, Chunpu, Guo, Shuyue, Zhang, Haoran, Qu, Xingwei, Wang, Junjie, Yuan, Ruibin, Li, Yizhi, Wang, Zekun, Liu, Yudong, Tsai, Yu-Hsuan, Zhang, Fengji, Lin, Chenghua, Huang, Wenhao, Chen, Wenhu, Fu, Jie
As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.
Towards Practical Autonomous Flight Simulation for Flapping Wing Biomimetic Robots with Experimental Validation
Qian, Chen, Fang, Yongchun, jia, Fan, Yan, Jifu, Liang, Yiming, Li, Tiefeng
Tried-and-true flapping wing robot simulation is essential in developing flapping wing mechanisms and algorithms. This paper presents a novel application-oriented flapping wing platform, highly compatible with various mechanical designs and adaptable to different robotic tasks. First, the blade element theory and the quasi-steady model are put forward to compute the flapping wing aerodynamics based on wing kinematics. Translational lift, translational drag, rotational lift, and added mass force are all considered in the computation. Then we use the proposed simulation platform to investigate the passive wing rotation and the wing-tail interaction phenomena of a particular flapping-wing robot. With the help of the simulation tool and a novel statistic based on dynamic differences from the averaged system, several behaviors display their essence by investigating the flapping wing robot dynamic characteristics. After that, the attitude tracking control problem and the positional trajectory tracking problem are both overcome by robust control techniques. Further comparison simulations reveal that the proposed control algorithms compared with other existing ones show apparent superiority. What is more, with the same control algorithm and parameters tuned in simulation, we conduct real flight experiments on a self-made flapping wing robot, and obtain similar results from the proposed simulation platform. In contrast to existing simulation tools, the proposed one is compatible with most existing flapping wing robots, and can inherently drill into each subtle behavior in corresponding applications by observing aerodynamic forces and torques on each blade element.