Generalized Decoding for Pixel, Image, and Language

Zou, Xueyan, Dou, Zi-Yi, Yang, Jianwei, Gan, Zhe, Li, Linjie, Li, Chunyuan, Dai, Xiyang, Behl, Harkirat, Wang, Jianfeng, Yuan, Lu, Peng, Nanyun, Wang, Lijuan, Lee, Yong Jae, Gao, Jianfeng

Dec-21-2022–arXiv.org Artificial Intelligence

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Dec-21-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States > Wisconsin > Dane County > Madison (0.04)

Genre:
- Research Report (0.81)

Industry:
- Leisure & Entertainment > Sports (0.45)
- Transportation > Ground
  - Road (0.46)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Communications (0.86)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Machine Learning (1.00)
    - Natural Language > Large Language Model (0.35)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found