AITopics | Zhang, Xinyue

Collaborating Authors

Zhang, Xinyue

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Yuan, Ruibin, Lin, Hanfeng, Guo, Shuyue, Zhang, Ge, Pan, Jiahao, Zang, Yongyi, Liu, Haohe, Liang, Yiming, Ma, Wenye, Du, Xingjian, Du, Xinrun, Ye, Zhen, Zheng, Tianyu, Ma, Yinghao, Liu, Minghao, Tian, Zeyue, Zhou, Ziya, Xue, Liumeng, Qu, Xingwei, Li, Yizhi, Wu, Shangda, Shen, Tianhao, Ma, Ziyang, Zhan, Jun, Wang, Chunhui, Wang, Yatian, Chi, Xiaowei, Zhang, Xinyue, Yang, Zhenzhu, Wang, Xiangzhou, Liu, Shansong, Mei, Lingrui, Li, Peng, Wang, Junjie, Yu, Jianwei, Pang, Guojian, Li, Xu, Wang, Zihao, Zhou, Xiaohuan, Yu, Lijun, Benetos, Emmanouil, Chen, Yong, Lin, Chenghua, Chen, Xie, Xia, Gus, Zhang, Zhaoxiang, Zhang, Chao, Chen, Wenhu, Zhou, Xinyu, Qiu, Xipeng, Dannenberg, Roger, Liu, Jiaheng, Yang, Jian, Huang, Wenhao, Xue, Wei, Tan, Xu, Guo, Yike

arXiv.org Artificial IntelligenceMar-11-2025

We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation

arxiv preprint arxiv, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2503.08638

Country: Asia > Japan (0.24)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.48)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)

Add feedback

Semantic Web and Creative AI -- A Technical Report from ISWS 2023

Ahmad, Raia Abu, Alharbi, Reham, Barile, Roberto, Böckling, Martin, Bolanos, Francisco, Bonfitto, Sara, Bruns, Oleksandra, Celino, Irene, Chudasama, Yashrajsinh, Critelli, Martin, d'Amato, Claudia, D'Ippolito, Giada, Dasoulas, Ioannis, De Giorgis, Stefano, De Leo, Vincenzo, Di Bonaventura, Chiara, Di Panfilo, Marco, Dobriy, Daniil, Domingue, John, Duan, Xuemin, Dumontier, Michel, Efeoglu, Sefika, Eschauzier, Ruben, Ginwa, Fakih, Ferranti, Nicolas, Graciotti, Arianna, Hanisch, Philipp, Hannah, George, Heidari, Golsa, Hogan, Aidan, Hussein, Hassan, Jouglar, Alexane, Kalo, Jan-Christoph, Kieffer, Manoé, Klironomos, Antonis, Koch, Inês, Lajewska, Weronika, Lazzari, Nicolas, Lindekrans, Mikael, Lippolis, Anna Sofia, Llugiqi, Majlinda, Mancini, Eleonora, Marzi, Eleonora, Menotti, Laura, Flores, Daniela Milon, Nagowah, Soulakshmee, Neubert, Kerstin, Niazmand, Emetis, Norouzi, Ebrahim, Martinez, Beatriz Olarte, Oudshoorn, Anouk Michelle, Poltronieri, Andrea, Presutti, Valentina, Purohit, Disha, Raoufi, Ensiyeh, Ringwald, Celian, Rockstroh, Johanna, Rudolph, Sebastian, Sack, Harald, Saeed, Zafar, Saeedizade, Mohammad Javad, Sahbi, Aya, Santini, Cristian, Simic, Aleksandra, Sommer, Dennis, Sousa, Rita, Tan, Mary Ann, Tarikere, Vidyashree, Tietz, Tabea, Tirpitz, Liam, Tomasino, Arnaldo, van Harmelen, Frank, Vissoci, Joao, Woods, Caitlin, Zhang, Bohui, Zhang, Xinyue, Zheng, Heng

arXiv.org Artificial IntelligenceJan-30-2025

The International Semantic Web Research School (ISWS) is a week-long intensive program designed to immerse participants in the field. This document reports a collaborative effort performed by ten teams of students, each guided by a senior researcher as their mentor, attending ISWS 2023. Each team provided a different perspective to the topic of creative AI, substantiated by a set of research questions as the main subject of their investigation. The 2023 edition of ISWS focuses on the intersection of Semantic Web technologies and Creative AI. ISWS 2023 explored various intersections between Semantic Web technologies and creative AI. A key area of focus was the potential of LLMs as support tools for knowledge engineering. Participants also delved into the multifaceted applications of LLMs, including legal aspects of creative content production, humans in the loop, decentralised approaches to multimodal generative AI models, nanopublications and AI for personal scientific knowledge graphs, commonsense knowledge in automatic story and narrative completion, generative AI for art critique, prompt engineering, automatic music composition, commonsense prototyping and conceptual blending, and elicitation of tacit knowledge. As Large Language Models and semantic technologies continue to evolve, new exciting prospects are emerging: a future where the boundaries between creative expression and factual knowledge become increasingly permeable and porous, leading to a world of knowledge that is both informative and inspiring.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.18542

Country:

Europe > Germany (1.00)
Asia (1.00)
Europe > United Kingdom > England (0.45)
(5 more...)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)

Industry:

Media > News (1.00)
Media > Music (1.00)
Leisure & Entertainment (1.00)
(8 more...)

Technology:

Information Technology > Communications > Web > Semantic Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Add feedback

A Machine Learning Approach for Emergency Detection in Medical Scenarios Using Large Language Models

Akaybicen, Ferit, Cummings, Aaron, Iwuagwu, Lota, Zhang, Xinyue, Adewuyi, Modupe

arXiv.org Artificial IntelligenceDec-20-2024

The rapid identification of medical emergencies through digital communication channels remains a critical challenge in modern healthcare delivery, particularly with the increasing prevalence of telemedicine. This paper presents a novel approach leveraging large language models (LLMs) and prompt engineering techniques for automated emergency detection in medical communications. We developed and evaluated a comprehensive system using multiple LLaMA model variants (1B, 3B, and 7B parameters) to classify medical scenarios as emergency or non-emergency situations. Our methodology incorporated both system prompts and in-prompt training approaches, evaluated across different hardware configurations. The results demonstrate exceptional performance, with the LLaMA 2 (7B) model achieving 99.7% accuracy and the LLaMA 3.2 (3B) model reaching 99.6% accuracy with optimal prompt engineering. Through systematic testing of training examples within the prompts, we identified that including 10 example scenarios in the model prompts yielded optimal classification performance. Processing speeds varied significantly between platforms, ranging from 0.05 to 2.2 seconds per request. The system showed particular strength in minimizing high-risk false negatives in emergency scenarios, which is crucial for patient safety. The code implementation and evaluation framework are publicly available on GitHub, facilitating further research and development in this crucial area of healthcare technology.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.16341

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Technology > Telehealth (0.90)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Entire-Space Variational Information Exploitation for Post-Click Conversion Rate Prediction

Fei, Ke, Zhang, Xinyue, Li, Jingjing

arXiv.org Artificial IntelligenceDec-17-2024

In recommender systems, post-click conversion rate (CVR) estimation is an essential task to model user preferences for items and estimate the value of recommendations. Sample selection bias (SSB) and data sparsity (DS) are two persistent challenges for post-click conversion rate (CVR) estimation. Currently, entire-space approaches that exploit unclicked samples through knowledge distillation are promising to mitigate SSB and DS simultaneously. Existing methods use non-conversion, conversion, or adaptive conversion predictors to generate pseudo labels for unclicked samples. However, they fail to consider the unbiasedness and information limitations of these pseudo labels. Motivated by such analysis, we propose an entire-space variational information exploitation framework (EVI) for CVR prediction. First, EVI uses a conditional entire-space CVR teacher to generate unbiased pseudo labels. Then, it applies variational information exploitation and logit distillation to transfer non-click space information to the target CVR estimator. We conduct extensive offline experiments on six large-scale datasets. EVI demonstrated a 2.25\% average improvement compared to the state-of-the-art baselines.

artificial intelligence, cvr estimator, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2502.15687

Country:

Asia > China (0.14)
North America > United States (0.14)
Europe > United Kingdom (0.14)

Genre: Research Report (0.82)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Add feedback

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Zhang, Pan, Dong, Xiaoyi, Cao, Yuhang, Zang, Yuhang, Qian, Rui, Wei, Xilin, Chen, Lin, Li, Yifei, Niu, Junbo, Ding, Shuangrui, Guo, Qipeng, Duan, Haodong, Chen, Xin, Lv, Han, Nie, Zheng, Zhang, Min, Wang, Bin, Zhang, Wenwei, Zhang, Xinyue, Ge, Jiaye, Li, Wei, Li, Jingwen, Tu, Zhongying, He, Conghui, Zhang, Xingcheng, Chen, Kai, Qiao, Yu, Lin, Dahua, Wang, Jiaqi

arXiv.org Artificial IntelligenceDec-12-2024

Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.09596

Country: Asia > China (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

Mobility-LLM: Learning Visiting Intentions and Travel Preferences from Human Mobility Data with Large Language Models

Gong, Letian, Lin, Yan, Zhang, Xinyue, Lu, Yiwen, Han, Xuedi, Liu, Yichen, Guo, Shengnan, Lin, Youfang, Wan, Huaiyu

arXiv.org Artificial IntelligenceOct-28-2024

Location-based services (LBS) have accumulated extensive human mobility data on diverse behaviors through check-in sequences. These sequences offer valuable insights into users' intentions and preferences. Yet, existing models analyzing check-in sequences fail to consider the semantics contained in these sequences, which closely reflect human visiting intentions and travel preferences, leading to an incomplete comprehension. Drawing inspiration from the exceptional semantic understanding and contextual information processing capabilities of large language models (LLMs) across various domains, we present Mobility-LLM, a novel framework that leverages LLMs to analyze check-in sequences for multiple tasks. Since LLMs cannot directly interpret check-ins, we reprogram these sequences to help LLMs comprehensively understand the semantics of human visiting intentions and travel preferences. Specifically, we introduce a visiting intention memory network (VIMN) to capture the visiting intentions at each record, along with a shared pool of human travel preference prompts (HTPP) to guide the LLM in understanding users' travel preferences. These components enhance the model's ability to extract and leverage semantic information from human mobility data effectively. Extensive experiments on four benchmark datasets and three downstream tasks demonstrate that our approach significantly outperforms existing models, underscoring the effectiveness of Mobility-LLM in advancing our understanding of human mobility data within LBS contexts.

acc, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2411.00823

Country:

Asia (1.00)
North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

OmniBench: Towards The Future of Universal Omni-Language Models

Li, Yizhi, Zhang, Ge, Ma, Yinghao, Yuan, Ruibin, Zhu, Kang, Guo, Hangyu, Liang, Yiming, Liu, Jiaheng, Wang, Zekun, Yang, Jian, Wu, Siwei, Qu, Xingwei, Shi, Jinjie, Zhang, Xinyue, Yang, Zhenzhu, Wang, Xiangzhou, Zhang, Zhaoxiang, Liu, Zachary, Benetos, Emmanouil, Huang, Wenhao, Lin, Chenghua

arXiv.org Artificial IntelligenceOct-3-2024

Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) most baselines models perform poorly (below 50\% accuracy) even when provided with alternative textual representations of images or/and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at https://m-a-p.ai/OmniBench.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2409.15272

Country:

Europe (0.28)
Asia (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Zhang, Ge, Qu, Scott, Liu, Jiaheng, Zhang, Chenchen, Lin, Chenghua, Yu, Chou Leuang, Pan, Danny, Cheng, Esther, Liu, Jie, Lin, Qunshu, Yuan, Raven, Zheng, Tuney, Pang, Wei, Du, Xinrun, Liang, Yiming, Ma, Yinghao, Li, Yizhi, Ma, Ziyang, Lin, Bill, Benetos, Emmanouil, Yang, Huan, Zhou, Junting, Ma, Kaijing, Liu, Minghao, Niu, Morry, Wang, Noah, Que, Quehry, Liu, Ruibo, Liu, Sine, Guo, Shawn, Gao, Soren, Zhou, Wangchunshu, Zhang, Xinyue, Zhou, Yizhi, Wang, Yubo, Bai, Yuelin, Zhang, Yuhan, Zhang, Yuxiang, Wang, Zenith, Yang, Zhenzhu, Zhao, Zijian, Zhang, Jiajun, Ouyang, Wanli, Huang, Wenhao, Chen, Wenhu

arXiv.org Artificial IntelligenceJul-10-2024

Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2405.19327

Country:

Asia > China (0.93)
North America > United States > California (0.14)

Genre:

Workflow (0.67)
Research Report > New Finding (0.45)

Industry:

Information Technology (1.00)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Zhang, Pan, Dong, Xiaoyi, Zang, Yuhang, Cao, Yuhang, Qian, Rui, Chen, Lin, Guo, Qipeng, Duan, Haodong, Wang, Bin, Ouyang, Linke, Zhang, Songyang, Zhang, Wenwei, Li, Yining, Gao, Yang, Sun, Peng, Zhang, Xinyue, Li, Wei, Li, Jingwen, Wang, Wenhai, Yan, Hang, He, Conghui, Zhang, Xingcheng, Chen, Kai, Dai, Jifeng, Qiao, Yu, Lin, Dahua, Wang, Jiaqi

arXiv.org Artificial IntelligenceJul-3-2024

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2407.0332

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Industry:

Energy (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Cao, Yuhang, Wang, Bin, Ouyang, Linke, Zhang, Songyang, Duan, Haodong, Zhang, Wenwei, Li, Yining, Yan, Hang, Gao, Yang, Chen, Zhe, Zhang, Xinyue, Li, Wei, Li, Jingwen, Wang, Wenhai, Chen, Kai, He, Conghui, Zhang, Xingcheng, Dai, Jifeng, Qiao, Yu, Lin, Dahua, Wang, Jiaqi

arXiv.org Artificial IntelligenceApr-9-2024

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 x 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 x 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks. The InternLM-XComposer2-4KHD model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2404.06512

Country:

Europe (0.46)
Asia > China (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Consumer Health (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback