ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Wang, Peng, Wang, Shijie, Lin, Junyang, Bai, Shuai, Zhou, Xiaohuan, Zhou, Jingren, Wang, Xinggang, Zhou, Chang

May-18-2023–arXiv.org Artificial Intelligence

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at https://github.com/OFA-Sys/ONE-PEACE.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

May-18-2023

arXiv.org PDF

Add feedback

Country:
- Asia > Vietnam > Long An Province > Tân An (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (0.94)
    - Vision > Image Understanding (0.66)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found