Perceptual Group Tokenizer: Building Perception with Iterative Grouping

Deng, Zhiwei, Chen, Ting, Li, Yang

arXiv.org Artificial Intelligence 

Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping (Palmer, 2002; Wagemans et al., 2012; Herzog, 2018). Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, establishing a new milestone for this paradigm. Ever since the surge of deep learning, feature detection has predominated the vision field and become the main principle behind representation learning backbone designs and made impressive progress (Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Chen et al., 2017; Tan & Le, 2019; Qi et al., 2020; Liu et al., 2022b). The success of the former paradigm is, although striking, raising the question of whether perceptual grouping can also be used as the driving principle to construct a visual recognition model. Different from detecting and selecting distinctive features, perceptual grouping emphasizes on learning feature space where similarity of all pixels can be effectively measured (Uijlings et al., 2013; Arbeláez et al., 2014). With such a feature space, semantically meaningful objects and regions can be easily discovered with a simple grouping algorithm and used as a compact set to represent an image (Uijlings et al., 2013; Arbeláez et al., 2014; Locatello et al., 2020).