Zhang, Xinyue
YuE: Scaling Open Foundation Models for Long-Form Music Generation
Yuan, Ruibin, Lin, Hanfeng, Guo, Shuyue, Zhang, Ge, Pan, Jiahao, Zang, Yongyi, Liu, Haohe, Liang, Yiming, Ma, Wenye, Du, Xingjian, Du, Xinrun, Ye, Zhen, Zheng, Tianyu, Ma, Yinghao, Liu, Minghao, Tian, Zeyue, Zhou, Ziya, Xue, Liumeng, Qu, Xingwei, Li, Yizhi, Wu, Shangda, Shen, Tianhao, Ma, Ziyang, Zhan, Jun, Wang, Chunhui, Wang, Yatian, Chi, Xiaowei, Zhang, Xinyue, Yang, Zhenzhu, Wang, Xiangzhou, Liu, Shansong, Mei, Lingrui, Li, Peng, Wang, Junjie, Yu, Jianwei, Pang, Guojian, Li, Xu, Wang, Zihao, Zhou, Xiaohuan, Yu, Lijun, Benetos, Emmanouil, Chen, Yong, Lin, Chenghua, Chen, Xie, Xia, Gus, Zhang, Zhaoxiang, Zhang, Chao, Chen, Wenhu, Zhou, Xinyu, Qiu, Xipeng, Dannenberg, Roger, Liu, Jiaheng, Yang, Jian, Huang, Wenhao, Xue, Wei, Tan, Xu, Guo, Yike
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation
Semantic Web and Creative AI -- A Technical Report from ISWS 2023
Ahmad, Raia Abu, Alharbi, Reham, Barile, Roberto, Bรถckling, Martin, Bolanos, Francisco, Bonfitto, Sara, Bruns, Oleksandra, Celino, Irene, Chudasama, Yashrajsinh, Critelli, Martin, d'Amato, Claudia, D'Ippolito, Giada, Dasoulas, Ioannis, De Giorgis, Stefano, De Leo, Vincenzo, Di Bonaventura, Chiara, Di Panfilo, Marco, Dobriy, Daniil, Domingue, John, Duan, Xuemin, Dumontier, Michel, Efeoglu, Sefika, Eschauzier, Ruben, Ginwa, Fakih, Ferranti, Nicolas, Graciotti, Arianna, Hanisch, Philipp, Hannah, George, Heidari, Golsa, Hogan, Aidan, Hussein, Hassan, Jouglar, Alexane, Kalo, Jan-Christoph, Kieffer, Manoรฉ, Klironomos, Antonis, Koch, Inรชs, Lajewska, Weronika, Lazzari, Nicolas, Lindekrans, Mikael, Lippolis, Anna Sofia, Llugiqi, Majlinda, Mancini, Eleonora, Marzi, Eleonora, Menotti, Laura, Flores, Daniela Milon, Nagowah, Soulakshmee, Neubert, Kerstin, Niazmand, Emetis, Norouzi, Ebrahim, Martinez, Beatriz Olarte, Oudshoorn, Anouk Michelle, Poltronieri, Andrea, Presutti, Valentina, Purohit, Disha, Raoufi, Ensiyeh, Ringwald, Celian, Rockstroh, Johanna, Rudolph, Sebastian, Sack, Harald, Saeed, Zafar, Saeedizade, Mohammad Javad, Sahbi, Aya, Santini, Cristian, Simic, Aleksandra, Sommer, Dennis, Sousa, Rita, Tan, Mary Ann, Tarikere, Vidyashree, Tietz, Tabea, Tirpitz, Liam, Tomasino, Arnaldo, van Harmelen, Frank, Vissoci, Joao, Woods, Caitlin, Zhang, Bohui, Zhang, Xinyue, Zheng, Heng
The International Semantic Web Research School (ISWS) is a week-long intensive program designed to immerse participants in the field. This document reports a collaborative effort performed by ten teams of students, each guided by a senior researcher as their mentor, attending ISWS 2023. Each team provided a different perspective to the topic of creative AI, substantiated by a set of research questions as the main subject of their investigation. The 2023 edition of ISWS focuses on the intersection of Semantic Web technologies and Creative AI. ISWS 2023 explored various intersections between Semantic Web technologies and creative AI. A key area of focus was the potential of LLMs as support tools for knowledge engineering. Participants also delved into the multifaceted applications of LLMs, including legal aspects of creative content production, humans in the loop, decentralised approaches to multimodal generative AI models, nanopublications and AI for personal scientific knowledge graphs, commonsense knowledge in automatic story and narrative completion, generative AI for art critique, prompt engineering, automatic music composition, commonsense prototyping and conceptual blending, and elicitation of tacit knowledge. As Large Language Models and semantic technologies continue to evolve, new exciting prospects are emerging: a future where the boundaries between creative expression and factual knowledge become increasingly permeable and porous, leading to a world of knowledge that is both informative and inspiring.
A Machine Learning Approach for Emergency Detection in Medical Scenarios Using Large Language Models
Akaybicen, Ferit, Cummings, Aaron, Iwuagwu, Lota, Zhang, Xinyue, Adewuyi, Modupe
The rapid identification of medical emergencies through digital communication channels remains a critical challenge in modern healthcare delivery, particularly with the increasing prevalence of telemedicine. This paper presents a novel approach leveraging large language models (LLMs) and prompt engineering techniques for automated emergency detection in medical communications. We developed and evaluated a comprehensive system using multiple LLaMA model variants (1B, 3B, and 7B parameters) to classify medical scenarios as emergency or non-emergency situations. Our methodology incorporated both system prompts and in-prompt training approaches, evaluated across different hardware configurations. The results demonstrate exceptional performance, with the LLaMA 2 (7B) model achieving 99.7% accuracy and the LLaMA 3.2 (3B) model reaching 99.6% accuracy with optimal prompt engineering. Through systematic testing of training examples within the prompts, we identified that including 10 example scenarios in the model prompts yielded optimal classification performance. Processing speeds varied significantly between platforms, ranging from 0.05 to 2.2 seconds per request. The system showed particular strength in minimizing high-risk false negatives in emergency scenarios, which is crucial for patient safety. The code implementation and evaluation framework are publicly available on GitHub, facilitating further research and development in this crucial area of healthcare technology.
Entire-Space Variational Information Exploitation for Post-Click Conversion Rate Prediction
Fei, Ke, Zhang, Xinyue, Li, Jingjing
In recommender systems, post-click conversion rate (CVR) estimation is an essential task to model user preferences for items and estimate the value of recommendations. Sample selection bias (SSB) and data sparsity (DS) are two persistent challenges for post-click conversion rate (CVR) estimation. Currently, entire-space approaches that exploit unclicked samples through knowledge distillation are promising to mitigate SSB and DS simultaneously. Existing methods use non-conversion, conversion, or adaptive conversion predictors to generate pseudo labels for unclicked samples. However, they fail to consider the unbiasedness and information limitations of these pseudo labels. Motivated by such analysis, we propose an entire-space variational information exploitation framework (EVI) for CVR prediction. First, EVI uses a conditional entire-space CVR teacher to generate unbiased pseudo labels. Then, it applies variational information exploitation and logit distillation to transfer non-click space information to the target CVR estimator. We conduct extensive offline experiments on six large-scale datasets. EVI demonstrated a 2.25\% average improvement compared to the state-of-the-art baselines.
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Zhang, Pan, Dong, Xiaoyi, Cao, Yuhang, Zang, Yuhang, Qian, Rui, Wei, Xilin, Chen, Lin, Li, Yifei, Niu, Junbo, Ding, Shuangrui, Guo, Qipeng, Duan, Haodong, Chen, Xin, Lv, Han, Nie, Zheng, Zhang, Min, Wang, Bin, Zhang, Wenwei, Zhang, Xinyue, Ge, Jiaye, Li, Wei, Li, Jingwen, Tu, Zhongying, He, Conghui, Zhang, Xingcheng, Chen, Kai, Qiao, Yu, Lin, Dahua, Wang, Jiaqi
Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
Mobility-LLM: Learning Visiting Intentions and Travel Preferences from Human Mobility Data with Large Language Models
Gong, Letian, Lin, Yan, Zhang, Xinyue, Lu, Yiwen, Han, Xuedi, Liu, Yichen, Guo, Shengnan, Lin, Youfang, Wan, Huaiyu
Location-based services (LBS) have accumulated extensive human mobility data on diverse behaviors through check-in sequences. These sequences offer valuable insights into users' intentions and preferences. Yet, existing models analyzing check-in sequences fail to consider the semantics contained in these sequences, which closely reflect human visiting intentions and travel preferences, leading to an incomplete comprehension. Drawing inspiration from the exceptional semantic understanding and contextual information processing capabilities of large language models (LLMs) across various domains, we present Mobility-LLM, a novel framework that leverages LLMs to analyze check-in sequences for multiple tasks. Since LLMs cannot directly interpret check-ins, we reprogram these sequences to help LLMs comprehensively understand the semantics of human visiting intentions and travel preferences. Specifically, we introduce a visiting intention memory network (VIMN) to capture the visiting intentions at each record, along with a shared pool of human travel preference prompts (HTPP) to guide the LLM in understanding users' travel preferences. These components enhance the model's ability to extract and leverage semantic information from human mobility data effectively. Extensive experiments on four benchmark datasets and three downstream tasks demonstrate that our approach significantly outperforms existing models, underscoring the effectiveness of Mobility-LLM in advancing our understanding of human mobility data within LBS contexts.
OmniBench: Towards The Future of Universal Omni-Language Models
Li, Yizhi, Zhang, Ge, Ma, Yinghao, Yuan, Ruibin, Zhu, Kang, Guo, Hangyu, Liang, Yiming, Liu, Jiaheng, Wang, Zekun, Yang, Jian, Wu, Siwei, Qu, Xingwei, Shi, Jinjie, Zhang, Xinyue, Yang, Zhenzhu, Wang, Xiangzhou, Zhang, Zhaoxiang, Liu, Zachary, Benetos, Emmanouil, Huang, Wenhao, Lin, Chenghua
Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) most baselines models perform poorly (below 50\% accuracy) even when provided with alternative textual representations of images or/and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at https://m-a-p.ai/OmniBench.
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
Zhang, Ge, Qu, Scott, Liu, Jiaheng, Zhang, Chenchen, Lin, Chenghua, Yu, Chou Leuang, Pan, Danny, Cheng, Esther, Liu, Jie, Lin, Qunshu, Yuan, Raven, Zheng, Tuney, Pang, Wei, Du, Xinrun, Liang, Yiming, Ma, Yinghao, Li, Yizhi, Ma, Ziyang, Lin, Bill, Benetos, Emmanouil, Yang, Huan, Zhou, Junting, Ma, Kaijing, Liu, Minghao, Niu, Morry, Wang, Noah, Que, Quehry, Liu, Ruibo, Liu, Sine, Guo, Shawn, Gao, Soren, Zhou, Wangchunshu, Zhang, Xinyue, Zhou, Yizhi, Wang, Yubo, Bai, Yuelin, Zhang, Yuhan, Zhang, Yuxiang, Wang, Zenith, Yang, Zhenzhu, Zhao, Zijian, Zhang, Jiajun, Ouyang, Wanli, Huang, Wenhao, Chen, Wenhu
Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Zhang, Pan, Dong, Xiaoyi, Zang, Yuhang, Cao, Yuhang, Qian, Rui, Chen, Lin, Guo, Qipeng, Duan, Haodong, Wang, Bin, Ouyang, Linke, Zhang, Songyang, Zhang, Wenwei, Li, Yining, Gao, Yang, Sun, Peng, Zhang, Xinyue, Li, Wei, Li, Jingwen, Wang, Wenhai, Yan, Hang, He, Conghui, Zhang, Xingcheng, Chen, Kai, Dai, Jifeng, Qiao, Yu, Lin, Dahua, Wang, Jiaqi
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Cao, Yuhang, Wang, Bin, Ouyang, Linke, Zhang, Songyang, Duan, Haodong, Zhang, Wenwei, Li, Yining, Yan, Hang, Gao, Yang, Chen, Zhe, Zhang, Xinyue, Li, Wei, Li, Jingwen, Wang, Wenhai, Chen, Kai, He, Conghui, Zhang, Xingcheng, Dai, Jifeng, Qiao, Yu, Lin, Dahua, Wang, Jiaqi
The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 x 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 x 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks. The InternLM-XComposer2-4KHD model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.