Subobject-level Image Tokenization

Chen, Delong, Cahyawijaya, Samuel, Liu, Jianfeng, Wang, Baoyuan, Fung, Pascale

Apr-23-2024–arXiv.org Artificial Intelligence

Transformer-based vision models typically tokenize images into fixed-size square patches as input units, which lacks the adaptability to image content and overlooks the inherent pixel grouping structure. Inspired by the subword tokenization widely adopted in language models, we propose an image tokenizer at a subobject level, where the subobjects are represented by semantically meaningful image segments obtained by segmentation models (e.g., segment anything models). To implement a learning system based on subobject tokenization, we first introduced a Direct Segment Anything Model (DirectSAM) that efficiently produces comprehensive segmentation of subobjects, then embed subobjects into compact latent vectors and fed them into a large language model for vision language learning. Empirical results demonstrated that our subobject-level tokenization significantly facilitates efficient learning of translating images into object and attribute descriptions compared to the traditional patch-level tokenization.

machine learning, natural language, tokenization, (19 more...)

arXiv.org Artificial Intelligence

Apr-23-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - Israel (0.14)
- Europe > France (0.28)
- North America > United States (0.47)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (1.00)
  - Representation & Reasoning > Object-Oriented Architecture (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found