Goto

Collaborating Authors

 tokenizer


Image Understanding Makes for A Good Tokenizer for Image Generation Luting Wang Y ang Zhao

Neural Information Processing Systems

Modern image generation (IG) models have been shown to capture rich semantics valuable for image understanding (IU) tasks. However, the potential of IU models to improve IG performance remains uncharted. We address this issue using a token-based IG framework, which relies on effective tokenizers to map images into token sequences. Currently, pixel reconstruction (e.g., VQGAN) dominates the training objective for tokenizers. In contrast, our approach adopts the feature reconstruction objective, where tokenizers are trained by distilling knowledge from pretrained IU encoders. Comprehensive comparisons indicate that tokeniz-ers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks.



Extending Video Masked Autoencoders to 128 frames

Neural Information Processing Systems

Video understanding has witnessed significant progress with recent video foundation models demonstrating strong performance owing to self-supervised pre-training objectives; Masked Autoencoders (MAE) being the design of choice.








Language Model Tokenizers Introduce Unfairness Between Languages

Neural Information Processing Systems

Recent language models have shown impressive multilingual performance, even when not explicitly trained for it. Despite this, there are concerns about the quality of their outputs across different languages. In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tok-enization lengths, with differences up to 15 times in some cases. These disparities persist even for tokenizers that are intentionally trained for multilingual support.