Elhoseiny, Mohamed
A Simple Baseline that Questions the Use of Pretrained-Models in Continual Learning
Janson, Paul, Zhang, Wenxuan, Aljundi, Rahaf, Elhoseiny, Mohamed
With the success of pretraining techniques in representation learning, a number of continual learning methods based on pretrained models have been proposed. Some of these methods design continual learning mechanisms on the pre-trained representations and only allow minimum updates or even no updates of the backbone models during the training of continual learning. In this paper, we question whether the complexity of these models is needed to achieve good performance by comparing them to a simple baseline that we designed. We argue that the pretrained feature extractor itself can be strong enough to achieve a competitive or even better continual learning performance on Split-CIFAR100 and CoRe 50 benchmarks. To validate this, we conduct a very simple baseline that 1) use the frozen pretrained model to extract image features for every class encountered during the continual learning stage and compute their corresponding mean features on training data, and 2) predict the class of the input based on the nearest neighbor distance between test samples and mean features of the classes; i.e., Nearest Mean Classifier (NMC). This baseline is single-headed, exemplar-free, and can be task-free (by updating the means continually). This baseline achieved 88.53% on 10-Split-CIFAR-100, surpassing most state-of-the-art continual learning methods that are all initialized using the same pretrained transformer model. We hope our baseline may encourage future progress in designing learning systems that can continually add quality to the learning representations even if they started from some pretrained weights.
Guiding Online Reinforcement Learning with Action-Free Offline Pretraining
Zhu, Deyao, Wang, Yuhui, Schmidhuber, Jรผrgen, Elhoseiny, Mohamed
Offline RL methods have been shown to reduce the need for environment interaction by training agents using offline collected episodes. However, these methods typically require action information to be logged during data collection, which can be difficult or even impossible in some practical cases. In this paper, we investigate the potential of using action-free offline datasets to improve online reinforcement learning, name this problem Reinforcement Learning with Action-Free Offline Pretraining (AFP-RL). We introduce Action-Free Guide (AF-Guide), a method that guides online training by extracting knowledge from action-free offline datasets. AF-Guide consists of an Action-Free Decision Transformer (AFDT) implementing a variant of Upside-Down Reinforcement Learning. It learns to plan the next states from the offline dataset, and a Guided Soft Actor-Critic (Guided SAC) that learns online with guidance from AFDT. Experimental results show that AF-Guide can improve sample efficiency and performance in online training thanks to the knowledge from the action-free offline dataset. Code is available at https://github.com/Vision-CAIR/AF-Guide.
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Zhu, Deyao, Chen, Jun, Haydarov, Kilichbek, Shen, Xiaoqian, Zhang, Wenxuan, Elhoseiny, Mohamed
Asking insightful questions is crucial for acquiring knowledge and expanding our understanding of the world. However, the importance of questioning has been largely overlooked in AI research, where models have been primarily developed to answer questions. With the recent advancements of large language models (LLMs) like ChatGPT, we discover their capability to ask high-quality questions when provided with a suitable prompt. This discovery presents a new opportunity to develop an automatic questioning system. In this paper, we introduce ChatCaptioner, a novel automatic-questioning method deployed in image captioning. Here, ChatGPT is prompted to ask a series of informative questions about images to BLIP-2, a strong vision question-answering model. By keeping acquiring new visual information from BLIP-2's answers, ChatCaptioner is able to generate more enriched image descriptions. We conduct human-subject evaluations on common image caption datasets such as COCO, Conceptual Caption, and WikiArt, and compare ChatCaptioner with BLIP-2 as well as ground truth. Our results demonstrate that ChatCaptioner's captions are significantly more informative, receiving three times as many votes from human evaluators for providing the most image information. Besides, ChatCaptioner identifies 53% more objects within the image than BLIP-2 alone measured by WordNet synset matching. Code is available at https://github.com/Vision-CAIR/ChatCaptioner
ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture
Mohamed, Youssef, Abdelfattah, Mohamed, Alhuwaider, Shyma, Li, Feifan, Zhang, Xiangliang, Church, Kenneth Ward, Elhoseiny, Mohamed
This paper introduces ArtELingo, a new benchmark and dataset, designed to encourage work on diversity across languages and cultures. Following ArtEmis, a collection of 80k artworks from WikiArt with 0.45M emotion labels and English-only captions, ArtELingo adds another 0.79M annotations in Arabic and Chinese, plus 4.8K in Spanish to evaluate "cultural-transfer" performance. More than 51K artworks have 5 annotations or more in 3 languages. This diversity makes it possible to study similarities and differences across languages and cultures. Further, we investigate captioning tasks, and find diversity improves the performance of baseline models. ArtELingo is publicly available at https://www.artelingo.org/ with standard splits and baseline models. We hope our work will help ease future research on multilinguality and culturally-aware AI.
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2
Skorokhodov, Ivan, Tulyakov, Sergey, Elhoseiny, Mohamed
Videos show continuous events, yet most - if not all - video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be - time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator. For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image and video discriminators pair and propose to use a single hypernetwork-based one. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 1024$^2$ videos for the first time. We build our model on top of StyleGAN2 and it is just 5% more expensive to train at the same resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model achieves state-of-the-art results on four modern 256$^2$ video synthesis benchmarks and one 1024$^2$ resolution one. Videos and the source code are available at the project website: https://universome.github.io/stylegan-v.
RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory
Chen, Jun, Agarwal, Aniket, Abdelkarim, Sherif, Zhu, Deyao, Elhoseiny, Mohamed
Visual relationship recognition (VRR) is a fundamental scene understanding task. The structure that VRR provides is essential to improve the AI interpretability in downstream tasks such as image captioning and visual question answering. Several recent studies showed that the long-tail problem in VRR is even more critical than that in object recognition due to the compositional complexity and structure. To overcome this limitation, we propose a novel transformer-based framework, dubbed as RelTransformer, which performs relationship prediction using rich semantic features from multiple image levels. We assume that more abundantcon textual features can generate more accurate and discriminative relationships, which can be useful when sufficient training data are lacking. The key feature of our model is its ability to aggregate three different-level features (local context, scene, and dataset-level) to compositionally predict the visual relationship. We evaluate our model on the visual genome and two "long-tail" VRR datasets, GQA-LT and VG8k-LT. Extensive experiments demonstrate that our RelTransformer could improve over the state-of-the-art baselines on all the datasets. In addition, our model significantly improves the accuracy of GQA-LT by 27.4% upon the best baselines on tail-relationship prediction. Our code is available in https://github.com/Vision-CAIR/RelTransformer.
Imaginative Walks: Generative Random Walk Deviation Loss for Improved Unseen Learning Representation
Elhoseiny, Mohamed, Jha, Divyansh, Yi, Kai, Skorokhodov, Ivan
We propose a novel loss for generative models, dubbed as GRaWD (Generative Random Walk Deviation), to improve learning representations of unexplored visual spaces. Quality learning representation of unseen classes (or styles) is crucial to facilitate novel image generation and better generative understanding of unseen visual classes (a.k.a. Zero-Shot Learning, ZSL). By generating representations of unseen classes from their semantic descriptions, such as attributes or text, Generative ZSL aims at identifying unseen categories discriminatively from seen ones. We define GRaWD by constructing a dynamic graph, including the seen class/style centers and generated samples in the current mini-batch. Our loss starts a random walk probability from each center through visual generations produced from hallucinated unseen classes. As a deviation signal, we encourage the random walk to eventually land after t steps in a feature representation that is hard to classify to any of the seen classes. We show that our loss can improve unseen class representation quality on four text-based ZSL benchmarks on CUB and NABirds datasets and three attribute-based ZSL benchmarks on AWA2, SUN, and aPY datasets. We also study our loss's ability to produce meaningful novel visual art generations on WikiArt dataset. Our experiments and human studies show that our loss can improve StyleGAN1 and StyleGAN2 generation quality, creating novel art that is significantly more preferred. Code will be made available.
Aligning Latent and Image Spaces to Connect the Unconnectable
Skorokhodov, Ivan, Sotnikov, Grigorii, Elhoseiny, Mohamed
In this work, we develop a method to generate infinite high-resolution images with diverse and complex content. It is based on a perfectly equivariant generator with synchronous interpolations in the image and latent spaces. Latent codes, when sampled, are positioned on the coordinate grid, and each pixel is computed from an interpolation of the nearby style codes. We modify the AdaIN mechanism to work in such a setup and train the generator in an adversarial setting to produce images positioned between any two latent vectors. At test time, this allows for generating complex and diverse infinite images and connecting any two unrelated scenes into a single arbitrarily large panorama. Apart from that, we introduce LHQ: a new dataset of \lhqsize high-resolution nature landscapes. We test the approach on LHQ, LSUN Tower and LSUN Bridge and outperform the baselines by at least 4 times in terms of quality and diversity of the produced infinite images. The project page is located at https://universome.github.io/alis.
VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining
Chen, Jun, Guo, Han, Yi, Kai, Li, Boyang, Elhoseiny, Mohamed
In this paper, we aim to improve the data efficiency of image captioning. We propose VisualGPT, a data-efficient image captioning model that leverages the linguistic knowledge from a large pretrained language model (LM). A crucial challenge is to balance between the use of visual information in the image and prior linguistic knowledge acquired from pretraining.We designed a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the pretrained LM as the language decoder on a small amount of in-domain training data. The pro-posed self-resurrecting activation unit produces sparse activations but is not susceptible to zero gradients. When trained on 0.1%, 0.5% and 1% of MSCOCO and Conceptual Captions, the proposed model, VisualGPT, surpasses strong image captioning baselines. VisualGPT outperforms the best baseline model by up to 10.8% CIDEr on MS COCO and up to 5.4% CIDEr on Conceptual Captions.We also perform a series of ablation studies to quantify the utility of each system component. To the best of our knowledge, this is the first work that improves data efficiency of image captioning by utilizing LM pretrained on unimodal data. Our code is available at: https://github.com/Vision-CAIR/VisualGPT.
CIZSL++: Creativity Inspired Generative Zero-Shot Learning
Elhoseiny, Mohamed, Yi, Kai, Elfeki, Mohamed
Zero-shot learning (ZSL) aims at understanding unseen categories with no training examples from class-level descriptions. To improve the discriminative power of ZSL, we model the visual learning process of unseen categories with inspiration from the psychology of human creativity for producing novel art. First, we propose CIZSL-v1 as a creativity inspired model for generative ZSL. We relate ZSL to human creativity by observing that ZSL is about recognizing the unseen, and creativity is about creating a likable unseen. We introduce a learning signal inspired by creativity literature that explores the unseen space with hallucinated class-descriptions and encourages careful deviation of their visual feature generations from seen classes while allowing knowledge transfer from seen to unseen classes. Second, CIZSL-v2 is proposed as an improved version of CIZSL-v1 for generative zero-shot learning. CIZSL-v2 consists of an investigation of additional inductive losses for unseen classes along with a semantic guided discriminator. Empirically, we show consistently that CIZSL losses can improve generative ZSL models on the challenging task of generalized ZSL from a noisy text on CUB and NABirds datasets. We also show the advantage of our approach to Attribute-based ZSL on AwA2, aPY, and SUN datasets. We also show that CIZSL-v2 has improved performance compared to CIZSL-v1.